Introduction: Business Problem Statement
1.1 Demand forecasting on Manufacturer, Caloric Segment, Category and Brand
1.2 Demand forecasting on Manufacturer, Caloric Segment and Flavor
2.1 Demand forecasting on Manufacturer, Caloric Segment, Category and Brand
2.2 Demand forecasting on Flavor, Manufacturer, Category, Caloric Segment
2.3 Demand forecasting on Flavor, Non-Swire Manufacturer, Category, and Caloric Segment
3.1 Demand forecasting on Brand, Manufacturer, Category, Caloric Segment in Southern Regions
3.2 Demand forecasting on Package, Caloric Segment, Category and Manufacturer in Southern Regions
3.3 Demand forecasting based on Package, Caloric Segment, Category and Non-Manufacturer
4.1 Demand forecasting on Package, Manufacturer, Category in the Northern region
4.2 Demand forecasting on Package, Manufacturer, Category in the Southern region
4.3 Demand forecasting on Category, Non-Manufacturer, and Package in North Region
4.4 Demand forecasting on Category, Non-Manufacturer, and Package in Southern Region
5.1 Demand forecasting on Category, Manufacturer and Caloric Segment
5.2 Demand forecasting on Flavor, Non-Manufacturer, Caloric Segment
5.3 Demand forecasting based on Package, Manufacturer, Caloric Segment and Brand
6.1 Demand forecasting on Caloric Segment, Category, Manufacturer and Brand
6.2 Demand forecasting on Caloric Segment, Flavor, Non-Manufacturer and Category
7.1 Demand forecasting on Caloric Segment, Category, Manufacturer and Brand
7.2 Demand forecasting on Caloric Segment, Flavor, Non-Manufacturer and Category
Swire Coca-Cola, USA is responsible for the production, sale, and distribution of Coca-Cola and various beverages across 13 states in the American West. The company is committed to continuously introducing innovative products into the market. Swire aims to enhance its production planning and management specifically for these products. Forecasting the demand for each innovative product listed so that this guarantees efficient resource utilization.
The analytic approach we used for the modeling is:
Identify regular products that closely resemble the specified innovative products and forecast sales by leveraging the sales data of these similar products. Determine the most relevant similar products based on factors such as brand, market category, manufacturer, package type, and/or flavor, matching the specifications of the specified innovative products. Analyze the weekly sales figures of these similar products. Aggregate the sales data of these products to predict the sales of the innovative products.
In this notebook, the modeling provides valuable forecast into the sales trends of products across various sub-segments and segment combinations. Additionally, we analyze demographic data alongside product segmentation. The integration of Python, SQL using Google Big Query, and Tableau provides a insightful analysis, which further can be used for modeling analysis.
#Importing Libraries
!pip install numpy pandas matplotlib statsmodels prophet tensorflow
!pip install prophet
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')
Item Description: Diet Smash Plum 11Small 4One
Caloric Segment: Diet
Market Category: SSD
Manufacturer: Swire-CC
Brand: Diet Smash
Package Type: 11Small 4One
Flavor: ‘Plum’
Which 13 weeks of the year would this product perform best in the market?
What is the forecasted demand, in weeks, for those 13 weeks?
We try filters with the category 'SSD' with Swire - CC, brand 'Diet Smash' and Diet/Light Caloric segment.
Before building the model to forecast the sales, let's check the data and find the similar products in the given dataset. Since, the dataset is large we use Google Big query to filter the data based on the requirements and import those filtered data to this notebook.
In the dataset provided to us we don't have dataset with combinations of Package Type: '11Small 4One' and Falavor: 'Plum'. So, we fist consider the other Caloric Segment: Diet, Market Category: SSD, Manufacturer: Swire-CC, and Brand: Diet Smash.
# Required Authentications for the big query.
from google.colab import auth
from google.cloud import bigquery
from google.colab import data_table
project = 'spring-swire-ca' # Project ID inserted based on the query results selected to explore
location = 'US' # Location inserted based on the query results selected to explore
client = bigquery.Client(project=project, location=location)
data_table.enable_dataframe_formatter()
auth.authenticate_user()
job = client.get_job('bquxjob_41b06c35_18e92c55ff5') # Job ID inserted based on the query results selected to explore
print(job.query)
SELECT DATE,SUM(UNIT_SALES) AS UNIT_SALES, SUM(DOLLAR_SALES) AS DOLLAR_SALES FROM `swirecc.fact_market_demand` WHERE MANUFACTURER = 'SWIRE-CC' AND CALORIC_SEGMENT = 'DIET/LIGHT' AND CATEGORY = 'SSD' AND BRAND = 'DIET SMASH' GROUP BY DATE;
job = client.get_job('bquxjob_41b06c35_18e92c55ff5') # Job ID inserted based on the query results selected to explore
results = job.to_dataframe()
results
| DATE | UNIT_SALES | DOLLAR_SALES | |
|---|---|---|---|
| 0 | 2021-10-16 | 4629.0 | 13467.15 |
| 1 | 2021-04-24 | 2061.0 | 3308.90 |
| 2 | 2022-01-01 | 4601.0 | 9645.27 |
| 3 | 2021-04-03 | 2291.0 | 3620.47 |
| 4 | 2021-11-06 | 4228.0 | 9983.71 |
| ... | ... | ... | ... |
| 142 | 2021-11-27 | 4729.0 | 10618.19 |
| 143 | 2022-01-29 | 4442.0 | 14286.74 |
| 144 | 2021-01-23 | 2684.0 | 5300.92 |
| 145 | 2022-11-19 | 1747.0 | 10371.57 |
| 146 | 2022-07-23 | 2136.0 | 11672.57 |
147 rows × 3 columns
We get the data from the google big query to this notebook and next we make changes to the dataframe 'result' by changing the 'DATE' format and dividing it into year, month and week.
# Converting 'DATE' column to datetime format
results['DATE'] = pd.to_datetime(results['DATE'])
# Extracting relevant features for forecasting
forecast_features = results[['DATE', 'UNIT_SALES', 'DOLLAR_SALES']]
# Adding additional time-related features
forecast_features['YEAR'] = forecast_features['DATE'].dt.year
forecast_features['MONTH'] = forecast_features['DATE'].dt.month
forecast_features['WEEK_OF_YEAR'] = forecast_features['DATE'].dt.isocalendar().week
# Displaying the prepared forecasting features
print("Forecasting Features:")
forecast_features
Forecasting Features:
| DATE | UNIT_SALES | DOLLAR_SALES | YEAR | MONTH | WEEK_OF_YEAR | |
|---|---|---|---|---|---|---|
| 0 | 2021-10-16 | 4629.0 | 13467.15 | 2021 | 10 | 41 |
| 1 | 2021-04-24 | 2061.0 | 3308.90 | 2021 | 4 | 16 |
| 2 | 2022-01-01 | 4601.0 | 9645.27 | 2022 | 1 | 52 |
| 3 | 2021-04-03 | 2291.0 | 3620.47 | 2021 | 4 | 13 |
| 4 | 2021-11-06 | 4228.0 | 9983.71 | 2021 | 11 | 44 |
| ... | ... | ... | ... | ... | ... | ... |
| 142 | 2021-11-27 | 4729.0 | 10618.19 | 2021 | 11 | 47 |
| 143 | 2022-01-29 | 4442.0 | 14286.74 | 2022 | 1 | 4 |
| 144 | 2021-01-23 | 2684.0 | 5300.92 | 2021 | 1 | 3 |
| 145 | 2022-11-19 | 1747.0 | 10371.57 | 2022 | 11 | 46 |
| 146 | 2022-07-23 | 2136.0 | 11672.57 | 2022 | 7 | 29 |
147 rows × 6 columns
We follow the same pattern for the import of the dataset from the google big query to this notebook and exporting year and month and week from the 'DATE' column throughout the notebook.
Exponential smoothing is a forecasting method that uses weighted averages of past observations to predict new values. It is most effective when the values of the time series follow a gradual trend and display seasonal behavior in which the values follow a repeated cyclical pattern over a given number of time steps.
import pandas as pd
import matplotlib.pyplot as plt
from statsmodels.tsa.holtwinters import ExponentialSmoothing
# Ensuring the DATE column is in datetime format and setting as the DataFrame's index
forecast_features['DATE'] = pd.to_datetime(forecast_features['DATE'])
forecast_features.set_index('DATE', inplace=True)
# Sorting the DataFrame by the datetime index
forecast_features.sort_index(inplace=True)
last_date = forecast_features.index.max()
# Preparing the forecast index for the next year after the last date
forecast_index = pd.date_range(start=last_date, periods=53, freq='W-SUN')[1:]
# Exponential Smoothing Forecast for the UNIT_SALES
exp_model = ExponentialSmoothing(
forecast_features['UNIT_SALES'],
trend='add',
seasonal='add',
seasonal_periods=52
).fit()
exp_forecast = exp_model.forecast(52)
exp_forecast.index = forecast_index
# Exponential Smoothing Forecast for the DOLLAR_SALES
exp_model_dollar = ExponentialSmoothing(
forecast_features['DOLLAR_SALES'],
trend='add',
seasonal='add',
seasonal_periods=52
).fit()
exp_forecast_dollar = exp_model_dollar.forecast(52)
exp_forecast_dollar.index = forecast_index
# Function to find out the best 13 weeks
def find_best_13_weeks(forecast):
# Define rolling sum over a window of 13 weeks
rolling_sum = forecast.rolling(window=13, min_periods=1).sum()
best_period_end = rolling_sum.idxmax()
best_period_start = best_period_end - pd.DateOffset(weeks=12) # 13 weeks include the end week
return best_period_start, best_period_end
# Find the best 13 weeks the UNIT_SALES
best_start_unit, best_end_unit = find_best_13_weeks(exp_forecast)
# Find the best 13 weeks for the DOLLAR_SALES
best_start_dollar, best_end_dollar = find_best_13_weeks(exp_forecast_dollar)
# Plotting function with adjustment for negative values
def plot_forecast_with_highlights(forecast, best_start, best_end, title):
plt.figure(figsize=(14, 7))
# Ensure no negative values in the forecast
forecast_positive = forecast.clip(lower=0)
plt.plot(forecast_positive.index, forecast_positive, label='Forecast')
plt.axvspan(best_start, best_end, color='orange', alpha=0.3, label='Best 13 Weeks')
plt.title(title)
plt.xlabel('Date')
plt.ylabel('Sales')
plt.legend()
plt.show()
# Plotting the forecasts with the best thirteen weeks highlighted
plot_forecast_with_highlights(exp_forecast, best_start_unit, best_end_unit, 'Unit Sales Forecast with Best 13 Weeks Highlighted')
plot_forecast_with_highlights(exp_forecast_dollar, best_start_dollar, best_end_dollar, 'Dollar Sales Forecast with Best 13 Weeks Highlighted')
From the plot we can see that the best 13 weeks are from the November to January months for unit and June to September for dollar sales.
# Defining the function to find out the best 13 weeks
def find_best_13_weeks(forecast):
rolling_sum = forecast.rolling(window=13, min_periods=1).sum()
best_period_end = rolling_sum.idxmax()
best_period_start = best_period_end - pd.DateOffset(weeks=12)
return best_period_start, best_period_end, rolling_sum.max()
# Finding the best 13 weeks for unit sales
best_start_unit, best_end_unit, max_sales_unit = find_best_13_weeks(exp_forecast)
# Finding the best 13 weeks for dollar sales
best_start_dollar, best_end_dollar, max_sales_dollar = find_best_13_weeks(exp_forecast_dollar)
# Output the best periods and total sales
print(f"Best 13 weeks for unit sales start on {best_start_unit.date()} and end on {best_end_unit.date()}, with total sales: {max_sales_unit}")
print(f"Best 13 weeks for dollar sales start on {best_start_dollar.date()} and end on {best_end_dollar.date()}, with total sales: {max_sales_dollar}")
exp_forecast.index.freq = 'W-SUN' # Here we are assuming forecasts start on Sundays
exp_forecast_dollar.index.freq = 'W-SUN'
# Now, let's find the values and months for the best 13 weeks
best_13_weeks_values_unit = exp_forecast.loc[best_start_unit:best_end_unit]
best_13_weeks_values_dollar = exp_forecast_dollar.loc[best_start_dollar:best_end_dollar]
# Printing out the results
print("Best 13 weeks for Unit Sales:")
print(best_13_weeks_values_unit)
print("\nBest 13 weeks for Dollar Sales:")
print(best_13_weeks_values_dollar)
# Extracting the month names for visualization
best_months_unit = best_13_weeks_values_unit.index.month_name().unique()
best_months_dollar = best_13_weeks_values_dollar.index.month_name().unique()
print("\nBest months for Unit Sales within the 13-week period:")
print(best_months_unit)
print("\nBest months for Dollar Sales within the 13-week period:")
print(best_months_dollar)
Best 13 weeks for unit sales start on 2023-11-05 and end on 2024-01-28, with total sales: 64186.768234957635 Best 13 weeks for dollar sales start on 2024-06-16 and end on 2024-09-08, with total sales: 580059.770161476 Best 13 weeks for Unit Sales: 2023-11-05 5687.332771 2023-11-12 5692.474623 2023-11-19 5510.433734 2023-11-26 5385.756328 2023-12-03 5346.653107 2023-12-10 4906.951640 2023-12-17 4634.071058 2023-12-24 4635.620448 2023-12-31 4851.530769 2024-01-07 4798.997773 2024-01-14 4211.564092 2024-01-21 3896.723106 2024-01-28 4628.658785 Freq: W-SUN, dtype: float64 Best 13 weeks for Dollar Sales: 2024-06-16 42130.221537 2024-06-23 42156.532409 2024-06-30 44000.197305 2024-07-07 43192.075601 2024-07-14 43080.532361 2024-07-21 45498.123601 2024-07-28 46176.066906 2024-08-04 48598.681192 2024-08-11 48579.299690 2024-08-18 44776.838333 2024-08-25 45696.473122 2024-09-01 43551.991290 2024-09-08 42622.736814 Freq: W-SUN, dtype: float64 Best months for Unit Sales within the 13-week period: Index(['November', 'December', 'January'], dtype='object') Best months for Dollar Sales within the 13-week period: Index(['June', 'July', 'August', 'September'], dtype='object')
The total sales of these products in these 13 weeks are 64186. And the dollar sales are 580059.
Let's evaluate the performance of the model using Mean Absolute Error (MAE) and Mean Squared Error (MSE).
# Splitting the data into train and test sets
split_point = int(len(forecast_features) * 0.8) # for an 80/20 split
train = forecast_features.iloc[:split_point]
test = forecast_features.iloc[split_point:]
# Defining the parameters for the Exponential Smoothing model
params = {'trend': 'add', 'seasonal': 'add', 'seasonal_periods': 52}
# Fitting the model on the training set for UNIT_SALES
exp_model_unit = ExponentialSmoothing(train['UNIT_SALES'], **params).fit()
# Generating forecasts for the test set period
unit_sales_forecast = exp_model_unit.forecast(len(test))
# Calculating MAE and MSE for UNIT_SALES
mae_unit = mean_absolute_error(test['UNIT_SALES'], unit_sales_forecast)
mse_unit = mean_squared_error(test['UNIT_SALES'], unit_sales_forecast)
# Repeating the process for the DOLLAR_SALES
exp_model_dollar = ExponentialSmoothing(train['DOLLAR_SALES'], **params).fit()
dollar_sales_forecast = exp_model_dollar.forecast(len(test))
mae_dollar = mean_absolute_error(test['DOLLAR_SALES'], dollar_sales_forecast)
mse_dollar = mean_squared_error(test['DOLLAR_SALES'], dollar_sales_forecast)
# Printing the evaluation metrics
print(f'UNIT_SALES - MAE: {mae_unit}, MSE: {mse_unit}')
print(f'DOLLAR_SALES - MAE: {mae_dollar}, MSE: {mse_dollar}')
UNIT_SALES - MAE: 4169.322570662927, MSE: 21871252.498852786 DOLLAR_SALES - MAE: 9752.579068279909, MSE: 123352246.47961982
We can see that the MAE for the unit sales model is 4169 and for the dollar sales is 9752, which is quite high.
So, let's analyze some other model and decide which model is the best.
Prophet is a procedure for forecasting time series data based on an additive model where non-linear trends are fit with yearly, weekly, and daily seasonality, plus holiday effects. It works best with time series that have strong seasonal effects and several seasons of historical data. Prophet is robust to missing data and shifts in the trend, and typically handles outliers well.
import pandas as pd
from prophet import Prophet
import matplotlib.pyplot as plt
from sklearn.metrics import mean_absolute_error, mean_squared_error
# Preparing the DataFrame for Prophet's convention
df_prophet_unit = forecast_features[['UNIT_SALES']].reset_index().rename(columns={'DATE': 'ds', 'UNIT_SALES': 'y'})
df_prophet_dollar = forecast_features[['DOLLAR_SALES']].reset_index().rename(columns={'DATE': 'ds', 'DOLLAR_SALES': 'y'})
# Fitting the Prophet model for unit sales
prophet_model_unit = Prophet(yearly_seasonality=True, weekly_seasonality=True)
prophet_model_unit.fit(df_prophet_unit)
# Fitting the Prophet model for dollar sales
prophet_model_dollar = Prophet(yearly_seasonality=True, weekly_seasonality=True)
prophet_model_dollar.fit(df_prophet_dollar)
# Creating a future dataframe for one year and make predictions
future = prophet_model_unit.make_future_dataframe(periods=365)
forecast_unit = prophet_model_unit.predict(future)
forecast_dollar = prophet_model_dollar.predict(future)
# Function to find out the best 13 weeks within the forecast period
def find_best_13_weeks(forecast):
forecast['rolling_sum'] = forecast['yhat'].rolling(window=91, min_periods=1, center=True).sum()
best_period_idx = forecast['rolling_sum'].idxmax()
best_period_start = forecast.iloc[best_period_idx - 91//2]['ds']
best_period_end = forecast.iloc[best_period_idx + 91//2]['ds']
return best_period_start, best_period_end
# Finding the best 13 weeks for the UNIT_SALES and DOLLAR_SALES
best_start_unit, best_end_unit = find_best_13_weeks(forecast_unit)
best_start_dollar, best_end_dollar = find_best_13_weeks(forecast_dollar)
# Plotting function
def plot_prophet_forecast_with_highlights(model, forecast, best_start, best_end, title):
fig = model.plot(forecast)
plt.axvspan(best_start, best_end, color='orange', alpha=0.3, label='Best 13 Weeks')
plt.title(title)
plt.xlabel('Date')
plt.ylabel('Sales')
plt.legend()
plt.show()
# Plotting the forecasts with the best 13 weeks highlighted
plot_prophet_forecast_with_highlights(prophet_model_unit, forecast_unit, best_start_unit, best_end_unit, 'Prophet Forecast for Unit Sales with Best 13 Weeks Highlighted')
plot_prophet_forecast_with_highlights(prophet_model_dollar, forecast_dollar, best_start_dollar, best_end_dollar, 'Prophet Forecast for Dollar Sales with Best 13 Weeks Highlighted')
INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this. DEBUG:cmdstanpy:input tempfile: /tmp/tmpu6u1ud2o/_9ytmngt.json DEBUG:cmdstanpy:input tempfile: /tmp/tmpu6u1ud2o/_4keyjxu.json DEBUG:cmdstanpy:idx 0 DEBUG:cmdstanpy:running CmdStan, num_threads: None DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.10/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=29624', 'data', 'file=/tmp/tmpu6u1ud2o/_9ytmngt.json', 'init=/tmp/tmpu6u1ud2o/_4keyjxu.json', 'output', 'file=/tmp/tmpu6u1ud2o/prophet_model32p_r4u3/prophet_model-20240331082642.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000'] 08:26:42 - cmdstanpy - INFO - Chain [1] start processing INFO:cmdstanpy:Chain [1] start processing 08:26:42 - cmdstanpy - INFO - Chain [1] done processing INFO:cmdstanpy:Chain [1] done processing INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this. DEBUG:cmdstanpy:input tempfile: /tmp/tmpu6u1ud2o/0o_jy1wr.json DEBUG:cmdstanpy:input tempfile: /tmp/tmpu6u1ud2o/megbq1dw.json DEBUG:cmdstanpy:idx 0 DEBUG:cmdstanpy:running CmdStan, num_threads: None DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.10/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=3164', 'data', 'file=/tmp/tmpu6u1ud2o/0o_jy1wr.json', 'init=/tmp/tmpu6u1ud2o/megbq1dw.json', 'output', 'file=/tmp/tmpu6u1ud2o/prophet_model9b5f7r5r/prophet_model-20240331082643.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000'] 08:26:43 - cmdstanpy - INFO - Chain [1] start processing INFO:cmdstanpy:Chain [1] start processing 08:26:43 - cmdstanpy - INFO - Chain [1] done processing INFO:cmdstanpy:Chain [1] done processing
From this plot we can see that the best 13 weeks for unit sales are July to September and for dollar sales August to October.
Now let's evaluate the model performance using the MAE and MSE.
# Splitting the data into train and test sets
split_point = int(len(forecast_features) * 0.8) # 80/20 split
train = forecast_features.iloc[:split_point].copy()
test = forecast_features.iloc[split_point:].copy()
# Reset index to ensure 'DATE' is a regular column
train.reset_index(inplace=True)
test.reset_index(inplace=True)
train.rename(columns={'DATE': 'ds', 'UNIT_SALES': 'y'}, inplace=True)
# Fitting the Prophet model for UNIT_SALES on the training set
prophet_model_unit = Prophet(yearly_seasonality=True, weekly_seasonality=True)
prophet_model_unit.fit(train[['ds', 'y']])
# Generating forecasts for the test set period
future_unit = prophet_model_unit.make_future_dataframe(periods=len(test))
forecast_unit = prophet_model_unit.predict(future_unit)
# Calculating MAE and MSE for UNIT_SALES
mae_unit = mean_absolute_error(test['UNIT_SALES'], forecast_unit['yhat'][-len(test):])
mse_unit = mean_squared_error(test['UNIT_SALES'], forecast_unit['yhat'][-len(test):])
# Repeating the process for DOLLAR_SALES
prophet_model_dollar = Prophet(yearly_seasonality=True, weekly_seasonality=True)
prophet_model_dollar.fit(train[['ds', 'y']])
future_dollar = prophet_model_dollar.make_future_dataframe(periods=len(test))
forecast_dollar = prophet_model_dollar.predict(future_dollar)
mae_dollar = mean_absolute_error(test['DOLLAR_SALES'], forecast_dollar['yhat'][-len(test):])
mse_dollar = mean_squared_error(test['DOLLAR_SALES'], forecast_dollar['yhat'][-len(test):])
INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this. DEBUG:cmdstanpy:input tempfile: /tmp/tmpu6u1ud2o/ld9evsuv.json DEBUG:cmdstanpy:input tempfile: /tmp/tmpu6u1ud2o/4m_5uxui.json DEBUG:cmdstanpy:idx 0 DEBUG:cmdstanpy:running CmdStan, num_threads: None DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.10/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=91717', 'data', 'file=/tmp/tmpu6u1ud2o/ld9evsuv.json', 'init=/tmp/tmpu6u1ud2o/4m_5uxui.json', 'output', 'file=/tmp/tmpu6u1ud2o/prophet_modelh2uop7_d/prophet_model-20240331082700.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000'] 08:27:00 - cmdstanpy - INFO - Chain [1] start processing INFO:cmdstanpy:Chain [1] start processing 08:27:00 - cmdstanpy - INFO - Chain [1] done processing INFO:cmdstanpy:Chain [1] done processing INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this. DEBUG:cmdstanpy:input tempfile: /tmp/tmpu6u1ud2o/f52nmv8j.json DEBUG:cmdstanpy:input tempfile: /tmp/tmpu6u1ud2o/fam67dcg.json DEBUG:cmdstanpy:idx 0 DEBUG:cmdstanpy:running CmdStan, num_threads: None DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.10/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=54608', 'data', 'file=/tmp/tmpu6u1ud2o/f52nmv8j.json', 'init=/tmp/tmpu6u1ud2o/fam67dcg.json', 'output', 'file=/tmp/tmpu6u1ud2o/prophet_modeldtg4ehpk/prophet_model-20240331082700.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000'] 08:27:00 - cmdstanpy - INFO - Chain [1] start processing INFO:cmdstanpy:Chain [1] start processing 08:27:00 - cmdstanpy - INFO - Chain [1] done processing INFO:cmdstanpy:Chain [1] done processing
# Printing the evaluation metrics
print(f'UNIT_SALES - MAE: {mae_unit}, MSE: {mse_unit}')
print(f'DOLLAR_SALES - MAE: {mae_dollar}, MSE: {mse_dollar}')
UNIT_SALES - MAE: 2583.4325462688034, MSE: 9513128.831265468 DOLLAR_SALES - MAE: 26826.278910410318, MSE: 815968070.3461262
Now the MAE and MSE values are reduced when compared to the exponential smoothing model.
Let's also try with the SARIMA time series model.
SARIMAX stands for Seasonal Autoregressive Integrated Moving Average with exogenous variables. It's a time series model that can handle external effects. SARIMAX is an extension of the ARIMA model, which is made up of two parts: the autoregressive term (AR) and the moving-average term (MA).
import pandas as pd
from statsmodels.tsa.statespace.sarimax import SARIMAX
import matplotlib.pyplot as plt
# Sorting the DataFrame by the datetime index just to be sure
forecast_features.sort_index(inplace=True)
# Defining the SARIMA model for UNIT_SALES
sarima_model_unit = SARIMAX(forecast_features['UNIT_SALES'], order=(1, 1, 1), seasonal_order=(1, 1, 1, 52))
sarima_result_unit = sarima_model_unit.fit()
# Defining the SARIMA model for DOLLAR_SALES
sarima_model_dollar = SARIMAX(forecast_features['DOLLAR_SALES'], order=(1, 1, 1), seasonal_order=(1, 1, 1, 52))
sarima_result_dollar = sarima_model_dollar.fit()
# Defining the date range for the next year after the last date in the dataset
last_date = forecast_features.index.max()
forecast_dates = pd.date_range(start=last_date + pd.Timedelta(days=1), periods=52, freq='W')
# Forecasting the next 52 periods (assuming weekly data)
sarima_forecast_unit = sarima_result_unit.get_forecast(steps=52).predicted_mean
sarima_forecast_dollar = sarima_result_dollar.get_forecast(steps=52).predicted_mean
# Converting forecasts to pandas Series with a DateTimeIndex
sarima_forecast_unit = pd.Series(sarima_forecast_unit.values, index=forecast_dates)
sarima_forecast_dollar = pd.Series(sarima_forecast_dollar.values, index=forecast_dates)
# Checking if rolling sum calculation is possible
rolling_sum = sarima_forecast_unit.rolling(window=13, min_periods=1).sum()
best_period_end = rolling_sum.idxmax()
if pd.notnull(best_period_end):
best_period_start = best_period_end - pd.DateOffset(weeks=12)
# Plotting SARIMA forecast with the best 13 weeks highlighted
plt.figure(figsize=(10, 6))
plt.plot(forecast_features.index, forecast_features['UNIT_SALES'], label='Actual Unit Sales', color='blue')
plt.plot(sarima_forecast_unit.index, sarima_forecast_unit, label='SARIMA Forecast', color='red')
plt.axvspan(best_period_start, best_period_end, color='yellow', alpha=0.3, label='Best 13 Weeks')
plt.title('SARIMA Forecast for Unit Sales with Best 13 Weeks Highlighted')
plt.xlabel('Date')
plt.ylabel('Unit Sales')
plt.legend()
plt.show()
else:
print("No best period is there")
From the plot we can say that the best 13 weeks for the unit sales are from August to October.
import pandas as pd
from statsmodels.tsa.statespace.sarimax import SARIMAX
import matplotlib.pyplot as plt
# Defining the SARIMA model for DOLLAR_SALES
sarima_model_dollar = SARIMAX(forecast_features['DOLLAR_SALES'], order=(1, 1, 1), seasonal_order=(1, 1, 1, 52))
sarima_result_dollar = sarima_model_dollar.fit()
# Forecasting the next 52 periods (assuming weekly data)
sarima_forecast_dollar = sarima_result_dollar.get_forecast(steps=52).predicted_mean
# Converting forecast to pandas Series with a DateTimeIndex
forecast_dates = pd.date_range(start=last_date + pd.Timedelta(days=1), periods=52, freq='W')
sarima_forecast_dollar = pd.Series(sarima_forecast_dollar.values, index=forecast_dates)
# Calculating the rolling sum over 13-week periods to find the best period for dollar sales
rolling_sum_dollar = sarima_forecast_dollar.rolling(window=13, min_periods=1).sum()
best_period_end_dollar = rolling_sum_dollar.idxmax()
if pd.notnull(best_period_end_dollar):
best_period_start_dollar = best_period_end_dollar - pd.DateOffset(weeks=12)
# Plotting the SARIMA forecast for DOLLAR_SALES with the best 13 weeks highlighted
plt.figure(figsize=(10, 6))
plt.plot(forecast_features.index, forecast_features['DOLLAR_SALES'], label='Actual Dollar Sales', color='blue')
plt.plot(sarima_forecast_dollar.index, sarima_forecast_dollar, label='SARIMA Forecast', color='red')
plt.axvspan(best_period_start_dollar, best_period_end_dollar, color='yellow', alpha=0.3, label='Best 13 Weeks')
plt.title('SARIMA Forecast for Dollar Sales with Best 13 Weeks Highlighted')
plt.xlabel('Date')
plt.ylabel('Dollar Sales')
plt.legend()
plt.show()
else:
print("No best period is ther")
From the plot we can say that the best 13 weeks for the dollar sales are from August to October.
# Function to calculate the best 13-week period for any given forecast series
def calculate_best_13_weeks(forecast_series):
rolling_sum = forecast_series.rolling(window=13, min_periods=1).sum()
max_sum_index = rolling_sum.idxmax()
max_sum_value = rolling_sum.max()
start_of_best_period = max_sum_index - pd.DateOffset(weeks=12) # 13 weeks including the end week
return start_of_best_period, max_sum_index, max_sum_value
# Calculating for Unit Sales
best_start_unit, best_end_unit, best_sales_unit = calculate_best_13_weeks(sarima_forecast_unit)
print(f"Best 13 Weeks for Unit Sales: {best_start_unit.date()} to {best_end_unit.date()}, Total Sales: {best_sales_unit}")
# Calculating for Dollar Sales
best_start_dollar, best_end_dollar, best_sales_dollar = calculate_best_13_weeks(sarima_forecast_dollar)
print(f"Best 13 Weeks for Dollar Sales: {best_start_dollar.date()} to {best_end_dollar.date()}, Total Sales: {best_sales_dollar}")
Best 13 Weeks for Unit Sales: 2024-07-21 to 2024-10-13, Total Sales: 98739.01788092876 Best 13 Weeks for Dollar Sales: 2024-07-28 to 2024-10-20, Total Sales: 737794.6046762894
The total sales for these best 13 weeks period is 98740 and total revenue is 737794 dollars.
Let's look at the performace of the model.
import pandas as pd
from statsmodels.tsa.statespace.sarimax import SARIMAX
from sklearn.metrics import mean_squared_error, mean_absolute_error
# Function for Walk Forward Validation for time series
def sarima_walk_forward_validation(data, order, sorder, start_train, end_test, step=1):
history = data[:start_train].tolist()
predictions = []
actual = []
# Walk forward over time steps in test
for i in range(start_train, end_test, step):
model = SARIMAX(history, order=order, seasonal_order=sorder, enforce_stationarity=False, enforce_invertibility=False)
model_fit = model.fit(disp=False)
yhat = model_fit.forecast()[0]
predictions.append(yhat)
actual.append(data[i])
history.append(data[i])
mse = mean_squared_error(actual, predictions)
mae = mean_absolute_error(actual, predictions)
return mse, mae, predictions
order = (1, 1, 1)
seasonal_order = (1, 1, 1, 52)
# Adjusting these values based on the size of the dataset
start_train = int(len(forecast_features) * 0.7) # Starting the training with 70% of the dataset
end_test = len(forecast_features)
unit_sales_data = forecast_features['UNIT_SALES']
dollar_sales_data = forecast_features['DOLLAR_SALES']
mse_unit, mae_unit, predictions_unit = sarima_walk_forward_validation(unit_sales_data, order, seasonal_order, start_train, end_test)
mse_dollar, mae_dollar, predictions_dollar = sarima_walk_forward_validation(dollar_sales_data, order, seasonal_order, start_train, end_test)
/usr/local/lib/python3.10/dist-packages/statsmodels/tsa/statespace/sarimax.py:866: UserWarning: Too few observations to estimate starting parameters for seasonal ARMA. All parameters except for variances will be set to zeros.
warn('Too few observations to estimate starting parameters%s.'
/usr/local/lib/python3.10/dist-packages/statsmodels/tsa/statespace/mlemodel.py:1234: RuntimeWarning: divide by zero encountered in divide
np.inner(score_obs, score_obs) /
/usr/local/lib/python3.10/dist-packages/statsmodels/tsa/statespace/mlemodel.py:1234: RuntimeWarning: invalid value encountered in divide
np.inner(score_obs, score_obs) /
/usr/local/lib/python3.10/dist-packages/statsmodels/base/model.py:607: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
warnings.warn("Maximum Likelihood optimization failed to "
/usr/local/lib/python3.10/dist-packages/statsmodels/base/model.py:607: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
warnings.warn("Maximum Likelihood optimization failed to "
/usr/local/lib/python3.10/dist-packages/statsmodels/base/model.py:607: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
warnings.warn("Maximum Likelihood optimization failed to "
/usr/local/lib/python3.10/dist-packages/statsmodels/base/model.py:607: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
warnings.warn("Maximum Likelihood optimization failed to "
/usr/local/lib/python3.10/dist-packages/statsmodels/base/model.py:607: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
warnings.warn("Maximum Likelihood optimization failed to "
/usr/local/lib/python3.10/dist-packages/statsmodels/base/model.py:607: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
warnings.warn("Maximum Likelihood optimization failed to "
/usr/local/lib/python3.10/dist-packages/statsmodels/base/model.py:607: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
warnings.warn("Maximum Likelihood optimization failed to "
/usr/local/lib/python3.10/dist-packages/statsmodels/tsa/statespace/sarimax.py:866: UserWarning: Too few observations to estimate starting parameters for seasonal ARMA. All parameters except for variances will be set to zeros.
warn('Too few observations to estimate starting parameters%s.'
/usr/local/lib/python3.10/dist-packages/statsmodels/tsa/statespace/mlemodel.py:1234: RuntimeWarning: divide by zero encountered in divide
np.inner(score_obs, score_obs) /
/usr/local/lib/python3.10/dist-packages/statsmodels/tsa/statespace/mlemodel.py:1234: RuntimeWarning: invalid value encountered in divide
np.inner(score_obs, score_obs) /
/usr/local/lib/python3.10/dist-packages/statsmodels/base/model.py:607: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
warnings.warn("Maximum Likelihood optimization failed to "
/usr/local/lib/python3.10/dist-packages/statsmodels/base/model.py:607: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
warnings.warn("Maximum Likelihood optimization failed to "
/usr/local/lib/python3.10/dist-packages/statsmodels/base/model.py:607: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
warnings.warn("Maximum Likelihood optimization failed to "
/usr/local/lib/python3.10/dist-packages/statsmodels/base/model.py:607: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
warnings.warn("Maximum Likelihood optimization failed to "
/usr/local/lib/python3.10/dist-packages/statsmodels/base/model.py:607: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
warnings.warn("Maximum Likelihood optimization failed to "
UNIT_SALES: MSE=12138087.243539423, MAE=1080.6169480631572 DOLLAR_SALES: MSE=14950588.254797772, MAE=2971.096624988875
print(f"UNIT_SALES: MSE={mse_unit}, MAE={mae_unit}")
print(f"DOLLAR_SALES: MSE={mse_dollar}, MAE={mae_dollar}")
UNIT_SALES: MSE=9513128.831265468, MAE=2583.4325462688034 DOLLAR_SALES: MSE=815968070.3461262, MAE=26826.278910410318
The MSE and MAE values decreased when compared to the other prophet and exponential smoothing models.
From the three models and their performance the best time for the sales are from August 21st to October 10th using the SARIMA model and their total sales are 98740 units and revenue is 737794 dollars.
Now we try other filters with the flavour 'Plum' with Swire - CC and Diet/Light Caloric segment.
Before building the model to forecast the sales, let's check the data and find the similar products in the given dataset. Since, the dataset is large we use Google Big query to filter the data based on the requirements and import those filtered data to this notebook.
In the dataset provided to us we don't have dataset with combinations of Package Type: '11Small 4One'. So, we fist consider the other Caloric Segment: Diet, Market Category: SSD, Manufacturer: Swire-CC, and Flavor: 'Plum'
from google.colab import auth
from google.cloud import bigquery
from google.colab import data_table
project = 'spring-swire-ca' # Project ID inserted based on the query results selected to explore
location = 'US' # Location inserted based on the query results selected to explore
client = bigquery.Client(project=project, location=location)
data_table.enable_dataframe_formatter()
auth.authenticate_user()
job = client.get_job('bquxjob_4684b0b3_18e96e150fb') # Job ID inserted based on the query results selected to explore
print(job.query)
SELECT DATE,SUM(UNIT_SALES) AS UNIT_SALES, SUM(DOLLAR_SALES) AS DOLLAR_SALES FROM `swirecc.fact_market_demand` WHERE ITEM LIKE '%PLUM%' AND MANUFACTURER = 'SWIRE-CC' AND CALORIC_SEGMENT = 'DIET/LIGHT' GROUP BY DATE;
job = client.get_job('bquxjob_4684b0b3_18e96e150fb') # Job ID inserted based on the query results selected to explore
results = job.to_dataframe()
results
| DATE | UNIT_SALES | DOLLAR_SALES | |
|---|---|---|---|
| 0 | 2022-02-26 | 860.0 | 805.31 |
| 1 | 2021-12-25 | 1276.0 | 1191.99 |
| 2 | 2020-12-12 | 1625.0 | 1505.14 |
| 3 | 2021-06-26 | 1635.0 | 1547.28 |
| 4 | 2022-04-16 | 1168.0 | 1060.76 |
| ... | ... | ... | ... |
| 135 | 2021-07-03 | 1617.0 | 1524.87 |
| 136 | 2023-05-27 | 955.0 | 1078.13 |
| 137 | 2022-08-13 | 1209.0 | 1316.36 |
| 138 | 2023-03-25 | 960.0 | 1109.77 |
| 139 | 2022-12-17 | 881.0 | 983.93 |
140 rows × 3 columns
We get the data from the google big query to this notebook and next we make changes to the dataframe 'result' by changing the 'DATE' format and dividing it into year, month and week.
# Converting 'DATE' column to datetime format
results['DATE'] = pd.to_datetime(results['DATE'])
# Extracting relevant features for forecasting
forecast_features = results[['DATE', 'UNIT_SALES', 'DOLLAR_SALES']]
# Adding additional time-related features
forecast_features['YEAR'] = forecast_features['DATE'].dt.year
forecast_features['MONTH'] = forecast_features['DATE'].dt.month
forecast_features['WEEK_OF_YEAR'] = forecast_features['DATE'].dt.isocalendar().week
# Displaying the prepared forecasting features
print("Forecasting Features:")
forecast_features
Forecasting Features:
| DATE | UNIT_SALES | DOLLAR_SALES | YEAR | MONTH | WEEK_OF_YEAR | |
|---|---|---|---|---|---|---|
| 0 | 2022-02-26 | 860.0 | 805.31 | 2022 | 2 | 8 |
| 1 | 2021-12-25 | 1276.0 | 1191.99 | 2021 | 12 | 51 |
| 2 | 2020-12-12 | 1625.0 | 1505.14 | 2020 | 12 | 50 |
| 3 | 2021-06-26 | 1635.0 | 1547.28 | 2021 | 6 | 25 |
| 4 | 2022-04-16 | 1168.0 | 1060.76 | 2022 | 4 | 15 |
| ... | ... | ... | ... | ... | ... | ... |
| 135 | 2021-07-03 | 1617.0 | 1524.87 | 2021 | 7 | 26 |
| 136 | 2023-05-27 | 955.0 | 1078.13 | 2023 | 5 | 21 |
| 137 | 2022-08-13 | 1209.0 | 1316.36 | 2022 | 8 | 32 |
| 138 | 2023-03-25 | 960.0 | 1109.77 | 2023 | 3 | 12 |
| 139 | 2022-12-17 | 881.0 | 983.93 | 2022 | 12 | 50 |
140 rows × 6 columns
We follow the same pattern for the import of the dataset from the google big query to this notebook and exporting year and month and week from the 'DATE' column throughout the notebook.
Here we get 140 rows of the filtered data from the google big query with Year and Month and Week columns.
Now let's do the modeling.
Exponential smoothing is a forecasting method that uses weighted averages of past observations to predict new values. It is most effective when the values of the time series follow a gradual trend and display seasonal behavior in which the values follow a repeated cyclical pattern over a given number of time steps.
import pandas as pd
import matplotlib.pyplot as plt
from statsmodels.tsa.holtwinters import ExponentialSmoothing
# Ensuring the DATE column is in datetime format and set as the DataFrame's index
forecast_features['DATE'] = pd.to_datetime(forecast_features['DATE'])
forecast_features.set_index('DATE', inplace=True)
# Sorting the DataFrame by the datetime index
forecast_features.sort_index(inplace=True)
# Defining the last date in the DataFrame
last_date = forecast_features.index.max()
# Preparing the forecast index for the next year after the last date
forecast_index = pd.date_range(start=last_date, periods=53, freq='W-SUN')[1:]
# Exponential Smoothing Forecast for UNIT_SALES
exp_model = ExponentialSmoothing(
forecast_features['UNIT_SALES'],
trend='add',
seasonal='add',
seasonal_periods=52
).fit()
exp_forecast = exp_model.forecast(52)
exp_forecast.index = forecast_index
# Exponential Smoothing Forecast for DOLLAR_SALES
exp_model_dollar = ExponentialSmoothing(
forecast_features['DOLLAR_SALES'],
trend='add',
seasonal='add',
seasonal_periods=52
).fit()
exp_forecast_dollar = exp_model_dollar.forecast(52)
exp_forecast_dollar.index = forecast_index
# Function to find out the best 13 weeks
def find_best_13_weeks(forecast):
rolling_sum = forecast.rolling(window=13, min_periods=1).sum()
best_period_end = rolling_sum.idxmax()
best_period_start = best_period_end - pd.DateOffset(weeks=12) # 13 weeks include the end week
return best_period_start, best_period_end
# Find the best 13 weeks for the unit sales
best_start_unit, best_end_unit = find_best_13_weeks(exp_forecast)
# Find the best 13 weeks for the dollar sales
best_start_dollar, best_end_dollar = find_best_13_weeks(exp_forecast_dollar)
# Plotting function
def plot_forecast_with_highlights(forecast, best_start, best_end, title):
plt.figure(figsize=(14, 7))
forecast = forecast.clip(lower=0) # Ensure no negative values in the forecast
plt.plot(forecast.index, forecast, label='Forecast')
plt.axvspan(best_start, best_end, color='orange', alpha=0.3, label='Best 13 Weeks')
plt.title(title)
plt.xlabel('Date')
plt.ylabel('Sales')
plt.legend()
plt.show()
# Plotting the forecasts with the best thirteen weeks highlighted
plot_forecast_with_highlights(exp_forecast, best_start_unit, best_end_unit, 'Unit Sales Forecast with Best 13 Weeks Highlighted')
plot_forecast_with_highlights(exp_forecast_dollar, best_start_dollar, best_end_dollar, 'Dollar Sales Forecast with Best 13 Weeks Highlighted')
The best 13 weeks for the sales are November to January end form the plot and the sales drop very big during second half of the year.
# Defining the function to find the best 13 weeks
def find_best_13_weeks(forecast):
rolling_sum = forecast.rolling(window=13, min_periods=1).sum()
best_period_end = rolling_sum.idxmax()
best_period_start = best_period_end - pd.DateOffset(weeks=12)
return best_period_start, best_period_end, rolling_sum.max()
# Finding the best 13 weeks for unit sales
best_start_unit, best_end_unit, max_sales_unit = find_best_13_weeks(exp_forecast)
# Finding the best 13 weeks for dollar sales
best_start_dollar, best_end_dollar, max_sales_dollar = find_best_13_weeks(exp_forecast_dollar)
# Output the best periods and total sales
print(f"Best 13 weeks for unit sales start on {best_start_unit.date()} and end on {best_end_unit.date()}, with total sales: {max_sales_unit}")
print(f"Best 13 weeks for dollar sales start on {best_start_dollar.date()} and end on {best_end_dollar.date()}, with total sales: {max_sales_dollar}")
exp_forecast.index.freq = 'W-SUN' # Assuming our forecasts start on Sundays
exp_forecast_dollar.index.freq = 'W-SUN'
# Now, let's find the values and months for the best 13 weeks
best_13_weeks_values_unit = exp_forecast.loc[best_start_unit:best_end_unit]
best_13_weeks_values_dollar = exp_forecast_dollar.loc[best_start_dollar:best_end_dollar]
# Print out the results
print("Best 13 weeks for Unit Sales:")
print(best_13_weeks_values_unit)
print("\nBest 13 weeks for Dollar Sales:")
print(best_13_weeks_values_dollar)
# Extracting the month names for visualization
best_months_unit = best_13_weeks_values_unit.index.month_name().unique()
best_months_dollar = best_13_weeks_values_dollar.index.month_name().unique()
print("\nBest months for Unit Sales within the 13-week period:")
print(best_months_unit)
print("\nBest months for Dollar Sales within the 13-week period:")
print(best_months_dollar)
Best 13 weeks for unit sales start on 2023-11-05 and end on 2024-01-28, with total sales: 6506.903198230882 Best 13 weeks for dollar sales start on 2023-11-05 and end on 2024-01-28, with total sales: 7724.323341631012 Best 13 weeks for Unit Sales: 2023-11-05 870.166374 2023-11-12 824.413819 2023-11-19 680.349468 2023-11-26 769.461487 2023-12-03 669.312831 2023-12-10 561.813252 2023-12-17 446.329587 2023-12-24 410.856229 2023-12-31 398.273982 2024-01-07 327.720387 2024-01-14 271.632940 2024-01-21 154.051637 2024-01-28 122.521204 Freq: W-SUN, dtype: float64 Best 13 weeks for Dollar Sales: 2023-11-05 871.962416 2023-11-12 880.221959 2023-11-19 741.583458 2023-11-26 884.959862 2023-12-03 773.559473 2023-12-10 683.116360 2023-12-17 570.747949 2023-12-24 527.911276 2023-12-31 550.818957 2024-01-07 366.850624 2024-01-14 327.699192 2024-01-21 256.345834 2024-01-28 288.545981 Freq: W-SUN, dtype: float64 Best months for Unit Sales within the 13-week period: Index(['November', 'December', 'January'], dtype='object') Best months for Dollar Sales within the 13-week period: Index(['November', 'December', 'January'], dtype='object')
The best 13 weeks start from the November 5th to 28th January for both unit sales and Dollar sales. The overall unit sales in this best 13 weeks period is 6506 and revenue is 7724 dollars.
Let's evaluate the performance of the model.
# Splitting the data into train and test sets
split_point = int(len(forecast_features) * 0.8) # for an 80/20 split
train = forecast_features.iloc[:split_point]
test = forecast_features.iloc[split_point:]
# Defining the parameters for the Exponential Smoothing model
params = {'trend': 'add', 'seasonal': 'add', 'seasonal_periods': 52}
# Fitting the model on the training set for UNIT_SALES
exp_model_unit = ExponentialSmoothing(train['UNIT_SALES'], **params).fit()
# Generating forecasts for the test set period
unit_sales_forecast = exp_model_unit.forecast(len(test))
# Calculating MAE and MSE for UNIT_SALES
mae_unit = mean_absolute_error(test['UNIT_SALES'], unit_sales_forecast)
mse_unit = mean_squared_error(test['UNIT_SALES'], unit_sales_forecast)
# Repeating the process for DOLLAR_SALES
exp_model_dollar = ExponentialSmoothing(train['DOLLAR_SALES'], **params).fit()
dollar_sales_forecast = exp_model_dollar.forecast(len(test))
mae_dollar = mean_absolute_error(test['DOLLAR_SALES'], dollar_sales_forecast)
mse_dollar = mean_squared_error(test['DOLLAR_SALES'], dollar_sales_forecast)
# Printing the evaluation metrics
print(f'UNIT_SALES - MAE: {mae_unit}, MSE: {mse_unit}')
print(f'DOLLAR_SALES - MAE: {mae_dollar}, MSE: {mse_dollar}')
UNIT_SALES - MAE: 1037.5271164522487, MSE: 1516515.6249574588 DOLLAR_SALES - MAE: 951.8799506669977, MSE: 1313122.1411227155
Here, the MAE is 1037 for unit sales and 951 for dollar sales. The MSE is 1516515 for unit sales and 1313122 for dollar sales which is quite high.
Let's look at the prophet time series model.
Prophet is a procedure for forecasting time series data based on an additive model where non-linear trends are fit with yearly, weekly, and daily seasonality, plus holiday effects. It works best with time series that have strong seasonal effects and several seasons of historical data. Prophet is robust to missing data and shifts in the trend, and typically handles outliers well.
import pandas as pd
from prophet import Prophet
import matplotlib.pyplot as plt
# Convert the 'DATE' column to datetime and sort by date
forecast_features['DATE'] = pd.to_datetime(forecast_features['DATE'])
forecast_features.sort_values('DATE', inplace=True)
# Preparing the DataFrame for Prophet's convention
df_prophet_unit = forecast_features[['DATE', 'UNIT_SALES']].rename(columns={'DATE': 'ds', 'UNIT_SALES': 'y'})
df_prophet_dollar = forecast_features[['DATE', 'DOLLAR_SALES']].rename(columns={'DATE': 'ds', 'DOLLAR_SALES': 'y'})
# Fit the Prophet model for unit sales
prophet_model_unit = Prophet(yearly_seasonality=True, weekly_seasonality=True)
prophet_model_unit.fit(df_prophet_unit)
# Fit the Prophet model for dollar sales
prophet_model_dollar = Prophet(yearly_seasonality=True, weekly_seasonality=True)
prophet_model_dollar.fit(df_prophet_dollar)
# Create a future dataframe for one year and make predictions
future_unit = prophet_model_unit.make_future_dataframe(periods=365)
future_dollar = prophet_model_dollar.make_future_dataframe(periods=365)
forecast_unit = prophet_model_unit.predict(future_unit)
forecast_dollar = prophet_model_dollar.predict(future_dollar)
# Get the last historical date
last_historical_date = df_prophet_unit['ds'].max()
# Function to find the best 13 weeks within the forecast period
def find_best_13_weeks(forecast, last_historical_date):
# Restrict to forecasted data after the last historical date
forecast_future = forecast[forecast['ds'] > last_historical_date]
forecast_future['rolling_sum'] = forecast_future['yhat'].rolling(window=13, min_periods=1).sum()
best_period_idx = forecast_future['rolling_sum'].idxmax()
best_period_start = forecast_future.loc[best_period_idx]['ds']
best_period_end = forecast_future.loc[best_period_idx]['ds'] + pd.DateOffset(weeks=12) # Include the end week
return best_period_start, best_period_end
# Find the best 13 weeks for unit sales and dollar sales
best_start_unit, best_end_unit = find_best_13_weeks(forecast_unit, last_historical_date)
best_start_dollar, best_end_dollar = find_best_13_weeks(forecast_dollar, last_historical_date)
# Plotting function
def plot_prophet_forecast_with_highlights(model, forecast, best_start, best_end, title):
fig = model.plot(forecast)
plt.axvspan(best_start, best_end, color='orange', alpha=0.3, label='Best 13 Weeks')
plt.title(title)
plt.xlabel('Date')
plt.ylabel('Sales')
plt.legend()
plt.show()
# Plot the forecasts with the best 13 weeks highlighted
plot_prophet_forecast_with_highlights(prophet_model_unit, forecast_unit, best_start_unit, best_end_unit, 'Prophet Forecast for Unit Sales with Best 13 Weeks Highlighted')
plot_prophet_forecast_with_highlights(prophet_model_dollar, forecast_dollar, best_start_dollar, best_end_dollar, 'Prophet Forecast for Dollar Sales with Best 13 Weeks Highlighted')
INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this. DEBUG:cmdstanpy:input tempfile: /tmp/tmpu6u1ud2o/9um0mf3y.json DEBUG:cmdstanpy:input tempfile: /tmp/tmpu6u1ud2o/m04e5ozh.json DEBUG:cmdstanpy:idx 0 DEBUG:cmdstanpy:running CmdStan, num_threads: None DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.10/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=95592', 'data', 'file=/tmp/tmpu6u1ud2o/9um0mf3y.json', 'init=/tmp/tmpu6u1ud2o/m04e5ozh.json', 'output', 'file=/tmp/tmpu6u1ud2o/prophet_modelho2pdmyy/prophet_model-20240331084344.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000'] 08:43:44 - cmdstanpy - INFO - Chain [1] start processing INFO:cmdstanpy:Chain [1] start processing 08:43:44 - cmdstanpy - INFO - Chain [1] done processing INFO:cmdstanpy:Chain [1] done processing INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this. DEBUG:cmdstanpy:input tempfile: /tmp/tmpu6u1ud2o/pgmudq4i.json DEBUG:cmdstanpy:input tempfile: /tmp/tmpu6u1ud2o/92j2d2a1.json DEBUG:cmdstanpy:idx 0 DEBUG:cmdstanpy:running CmdStan, num_threads: None DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.10/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=38994', 'data', 'file=/tmp/tmpu6u1ud2o/pgmudq4i.json', 'init=/tmp/tmpu6u1ud2o/92j2d2a1.json', 'output', 'file=/tmp/tmpu6u1ud2o/prophet_modeljujtule_/prophet_model-20240331084344.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000'] 08:43:44 - cmdstanpy - INFO - Chain [1] start processing INFO:cmdstanpy:Chain [1] start processing 08:43:44 - cmdstanpy - INFO - Chain [1] done processing INFO:cmdstanpy:Chain [1] done processing
The best 13 weeks for the product is August to November for both unit sales and dollar sales.
Let's evaluate the performance of the model.
# Split the data into train and test sets
split_point = int(len(forecast_features) * 0.8) # for an 80/20 split
train = forecast_features.iloc[:split_point].copy() # Make a copy to avoid modifying the original DataFrame
test = forecast_features.iloc[split_point:].copy() # Make a copy to avoid modifying the original DataFrame
# Reset index to ensure 'DATE' is a regular column
train.reset_index(inplace=True)
test.reset_index(inplace=True)
train.rename(columns={'DATE': 'ds', 'UNIT_SALES': 'y'}, inplace=True)
# Fit the Prophet model for UNIT_SALES on the training set
prophet_model_unit = Prophet(yearly_seasonality=True, weekly_seasonality=True)
prophet_model_unit.fit(train[['ds', 'y']]) # Ensure 'ds' and 'y' columns are selected
# Generate forecasts for the test set period
future_unit = prophet_model_unit.make_future_dataframe(periods=len(test))
forecast_unit = prophet_model_unit.predict(future_unit)
# Now we calculate MAE and MSE for UNIT_SALES
mae_unit = mean_absolute_error(test['UNIT_SALES'], forecast_unit['yhat'][-len(test):])
mse_unit = mean_squared_error(test['UNIT_SALES'], forecast_unit['yhat'][-len(test):])
# Repeating the process for DOLLAR_SALES
prophet_model_dollar = Prophet(yearly_seasonality=True, weekly_seasonality=True)
prophet_model_dollar.fit(train[['ds', 'y']]) # Ensure 'ds' and 'y' columns are selected
future_dollar = prophet_model_dollar.make_future_dataframe(periods=len(test))
forecast_dollar = prophet_model_dollar.predict(future_dollar)
mae_dollar = mean_absolute_error(test['DOLLAR_SALES'], forecast_dollar['yhat'][-len(test):])
mse_dollar = mean_squared_error(test['DOLLAR_SALES'], forecast_dollar['yhat'][-len(test):])
INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this. DEBUG:cmdstanpy:input tempfile: /tmp/tmpu6u1ud2o/f75ygkv0.json DEBUG:cmdstanpy:input tempfile: /tmp/tmpu6u1ud2o/h76390om.json DEBUG:cmdstanpy:idx 0 DEBUG:cmdstanpy:running CmdStan, num_threads: None DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.10/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=54329', 'data', 'file=/tmp/tmpu6u1ud2o/f75ygkv0.json', 'init=/tmp/tmpu6u1ud2o/h76390om.json', 'output', 'file=/tmp/tmpu6u1ud2o/prophet_modellz9e0dp3/prophet_model-20240331084823.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000'] 08:48:23 - cmdstanpy - INFO - Chain [1] start processing INFO:cmdstanpy:Chain [1] start processing 08:48:23 - cmdstanpy - INFO - Chain [1] done processing INFO:cmdstanpy:Chain [1] done processing INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this. DEBUG:cmdstanpy:input tempfile: /tmp/tmpu6u1ud2o/p2x9tl7m.json DEBUG:cmdstanpy:input tempfile: /tmp/tmpu6u1ud2o/0yumo_ly.json DEBUG:cmdstanpy:idx 0 DEBUG:cmdstanpy:running CmdStan, num_threads: None DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.10/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=26604', 'data', 'file=/tmp/tmpu6u1ud2o/p2x9tl7m.json', 'init=/tmp/tmpu6u1ud2o/0yumo_ly.json', 'output', 'file=/tmp/tmpu6u1ud2o/prophet_modelr4tc7bdn/prophet_model-20240331084823.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000'] 08:48:23 - cmdstanpy - INFO - Chain [1] start processing INFO:cmdstanpy:Chain [1] start processing 08:48:23 - cmdstanpy - INFO - Chain [1] done processing INFO:cmdstanpy:Chain [1] done processing
# Printing the evaluation metrics
print(f'UNIT_SALES - MAE: {mae_unit}, MSE: {mse_unit}')
print(f'DOLLAR_SALES - MAE: {mae_dollar}, MSE: {mse_dollar}')
UNIT_SALES - MAE: 274.3332717188195, MSE: 93934.55080369273 DOLLAR_SALES - MAE: 375.8133333053362, MSE: 162370.21758420693
The MAE and MSE values are decreased when compared to the Exponential smoothing model.
Let's use the SARIMA model and evaluate it.
SARIMAX stands for Seasonal Autoregressive Integrated Moving Average with exogenous variables. It's a time series model that can handle external effects. SARIMAX is an extension of the ARIMA model, which is made up of two parts: the autoregressive term (AR) and the moving-average term (MA).
import pandas as pd
from statsmodels.tsa.statespace.sarimax import SARIMAX
import matplotlib.pyplot as plt
# Ensure the DATE column is in datetime format and set as the DataFrame's index
forecast_features['DATE'] = pd.to_datetime(forecast_features['DATE'])
forecast_features.set_index('DATE', inplace=True)
# Sort the DataFrame by the datetime index just to be sure
forecast_features.sort_index(inplace=True)
# Define the SARIMA model for UNIT_SALES
sarima_model_unit = SARIMAX(forecast_features['UNIT_SALES'], order=(1, 1, 1), seasonal_order=(1, 1, 1, 52))
sarima_result_unit = sarima_model_unit.fit()
# Define the SARIMA model for DOLLAR_SALES
sarima_model_dollar = SARIMAX(forecast_features['DOLLAR_SALES'], order=(1, 1, 1), seasonal_order=(1, 1, 1, 52))
sarima_result_dollar = sarima_model_dollar.fit()
# Define the date range for the next year after the last date in the dataset
last_date = forecast_features.index.max()
forecast_dates = pd.date_range(start=last_date + pd.Timedelta(days=1), periods=52, freq='W')
# Forecast the next 52 periods (assuming weekly data)
sarima_forecast_unit = sarima_result_unit.get_forecast(steps=52).predicted_mean
sarima_forecast_dollar = sarima_result_dollar.get_forecast(steps=52).predicted_mean
# Convert forecasts to pandas Series with a DateTimeIndex
sarima_forecast_unit = pd.Series(sarima_forecast_unit.values, index=forecast_dates)
sarima_forecast_dollar = pd.Series(sarima_forecast_dollar.values, index=forecast_dates)
# Check if rolling sum calculation is possible
rolling_sum = sarima_forecast_unit.rolling(window=13, min_periods=1).sum()
best_period_end = rolling_sum.idxmax()
if pd.notnull(best_period_end):
best_period_start = best_period_end - pd.DateOffset(weeks=12)
# Plotting SARIMA forecast with the best 13 weeks highlighted
plt.figure(figsize=(10, 6))
plt.plot(forecast_features.index, forecast_features['UNIT_SALES'], label='Actual Unit Sales', color='blue')
plt.plot(sarima_forecast_unit.index, sarima_forecast_unit, label='SARIMA Forecast', color='red')
plt.axvspan(best_period_start, best_period_end, color='yellow', alpha=0.3, label='Best 13 Weeks')
plt.title('SARIMA Forecast for Unit Sales with Best 13 Weeks Highlighted')
plt.xlabel('Date')
plt.ylabel('Unit Sales')
plt.legend()
plt.show()
else:
print("No best period is there")
The best 13 weeks of the model is from November to January for unit sales.
import pandas as pd
from statsmodels.tsa.statespace.sarimax import SARIMAX
import matplotlib.pyplot as plt
# Define the SARIMA model for DOLLAR_SALES
sarima_model_dollar = SARIMAX(forecast_features['DOLLAR_SALES'], order=(1, 1, 1), seasonal_order=(1, 1, 1, 52))
sarima_result_dollar = sarima_model_dollar.fit()
# Forecast the next 52 periods (assuming weekly data)
sarima_forecast_dollar = sarima_result_dollar.get_forecast(steps=52).predicted_mean
# Convert forecast to pandas Series with a DateTimeIndex
forecast_dates = pd.date_range(start=last_date + pd.Timedelta(days=1), periods=52, freq='W')
sarima_forecast_dollar = pd.Series(sarima_forecast_dollar.values, index=forecast_dates)
# Calculate the rolling sum over 13-week periods to find the best period for dollar sales
rolling_sum_dollar = sarima_forecast_dollar.rolling(window=13, min_periods=1).sum()
best_period_end_dollar = rolling_sum_dollar.idxmax()
if pd.notnull(best_period_end_dollar):
best_period_start_dollar = best_period_end_dollar - pd.DateOffset(weeks=12)
# Plot the SARIMA forecast for DOLLAR_SALES with the best 13 weeks highlighted
plt.figure(figsize=(10, 6))
plt.plot(forecast_features.index, forecast_features['DOLLAR_SALES'], label='Actual Dollar Sales', color='blue')
plt.plot(sarima_forecast_dollar.index, sarima_forecast_dollar, label='SARIMA Forecast', color='red')
plt.axvspan(best_period_start_dollar, best_period_end_dollar, color='yellow', alpha=0.3, label='Best 13 Weeks')
plt.title('SARIMA Forecast for Dollar Sales with Best 13 Weeks Highlighted')
plt.xlabel('Date')
plt.ylabel('Dollar Sales')
plt.legend()
plt.show()
else:
print("No best period is there")
The best 13 weeks of the model is from November to January for the dollar sales.
import pandas as pd
from statsmodels.tsa.statespace.sarimax import SARIMAX
from sklearn.metrics import mean_squared_error, mean_absolute_error
# Function for Walk Forward Validation for time series
def sarima_walk_forward_validation(data, order, sorder, start_train, end_test, step=1):
history = data[:start_train].tolist()
predictions = []
actual = []
# Walk forward over time steps in test
for i in range(start_train, end_test, step):
model = SARIMAX(history, order=order, seasonal_order=sorder, enforce_stationarity=False, enforce_invertibility=False)
model_fit = model.fit(disp=False)
yhat = model_fit.forecast()[0]
predictions.append(yhat)
actual.append(data[i])
history.append(data[i]) # observation
mse = mean_squared_error(actual, predictions)
mae = mean_absolute_error(actual, predictions)
return mse, mae, predictions
order = (1, 1, 1)
seasonal_order = (1, 1, 1, 52)
# Adjusting these values based on the size of your dataset
start_train = int(len(forecast_features) * 0.7)
end_test = len(forecast_features)
unit_sales_data = forecast_features['UNIT_SALES']
dollar_sales_data = forecast_features['DOLLAR_SALES']
mse_unit, mae_unit, predictions_unit = sarima_walk_forward_validation(unit_sales_data, order, seasonal_order, start_train, end_test)
mse_dollar, mae_dollar, predictions_dollar = sarima_walk_forward_validation(dollar_sales_data, order, seasonal_order, start_train, end_test)
/usr/local/lib/python3.10/dist-packages/statsmodels/tsa/statespace/sarimax.py:866: UserWarning: Too few observations to estimate starting parameters for seasonal ARMA. All parameters except for variances will be set to zeros.
warn('Too few observations to estimate starting parameters%s.'
/usr/local/lib/python3.10/dist-packages/statsmodels/tsa/statespace/mlemodel.py:1234: RuntimeWarning: divide by zero encountered in divide
np.inner(score_obs, score_obs) /
/usr/local/lib/python3.10/dist-packages/statsmodels/tsa/statespace/mlemodel.py:1234: RuntimeWarning: invalid value encountered in divide
np.inner(score_obs, score_obs) /
/usr/local/lib/python3.10/dist-packages/statsmodels/base/model.py:607: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
warnings.warn("Maximum Likelihood optimization failed to "
/usr/local/lib/python3.10/dist-packages/statsmodels/base/model.py:607: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
warnings.warn("Maximum Likelihood optimization failed to "
/usr/local/lib/python3.10/dist-packages/statsmodels/base/model.py:607: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
warnings.warn("Maximum Likelihood optimization failed to "
/usr/local/lib/python3.10/dist-packages/statsmodels/base/model.py:607: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
warnings.warn("Maximum Likelihood optimization failed to "
/usr/local/lib/python3.10/dist-packages/statsmodels/base/model.py:607: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
warnings.warn("Maximum Likelihood optimization failed to "
/usr/local/lib/python3.10/dist-packages/statsmodels/tsa/statespace/sarimax.py:866: UserWarning: Too few observations to estimate starting parameters for seasonal ARMA. All parameters except for variances will be set to zeros.
warn('Too few observations to estimate starting parameters%s.'
/usr/local/lib/python3.10/dist-packages/statsmodels/tsa/statespace/mlemodel.py:1234: RuntimeWarning: divide by zero encountered in divide
np.inner(score_obs, score_obs) /
/usr/local/lib/python3.10/dist-packages/statsmodels/tsa/statespace/mlemodel.py:1234: RuntimeWarning: invalid value encountered in divide
np.inner(score_obs, score_obs) /
/usr/local/lib/python3.10/dist-packages/statsmodels/base/model.py:607: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
warnings.warn("Maximum Likelihood optimization failed to "
/usr/local/lib/python3.10/dist-packages/statsmodels/base/model.py:607: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
warnings.warn("Maximum Likelihood optimization failed to "
/usr/local/lib/python3.10/dist-packages/statsmodels/base/model.py:607: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
warnings.warn("Maximum Likelihood optimization failed to "
/usr/local/lib/python3.10/dist-packages/statsmodels/base/model.py:607: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
warnings.warn("Maximum Likelihood optimization failed to "
/usr/local/lib/python3.10/dist-packages/statsmodels/base/model.py:607: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
warnings.warn("Maximum Likelihood optimization failed to "
print(f"UNIT_SALES: MSE={mse_unit}, MAE={mae_unit}")
print(f"DOLLAR_SALES: MSE={mse_dollar}, MAE={mae_dollar}")
UNIT_SALES: MSE=419357.07149860164, MAE=521.6442435461662 DOLLAR_SALES: MSE=277010.1154416626, MAE=398.3749394864194
The MAE and MSE values for the model are 521 and 419357 respectively for unit sales values and 398 and 277010 for dollar sales.
From the models we can see that the model provides with the is the best model with the best solutions and the months from November to January has the best 13 weeks.
Item Description: Diet Venomous Blast Energy Drink Kiwano 16 Liquid Small
Caloric Segment: Diet
Market Category: Energy
Manufacturer: Swire-CC
Brand: Venomous Blast
Package Type: 16 Liquid Small
Flavor: ’Kiwano’
Which 13 weeks of the year would this product perform best in the market?
What is the forecasted demand, in weeks, for those 13 weeks?
We try filters with the category 'Energy' with Swire - CC, brand 'Venomous Blast' and Diet/Light Caloric segment.
Before building the model to forecast the sales, let's check the data and find the similar products in the given dataset. Since, the dataset is large we use Google Big query to filter the data based on the requirements and import those filtered data to this notebook.
In the dataset provided to us we don't have dataset with combinations of Package Type: '16 liquid small'. So, we fist consider the other Caloric Segment: Diet, Market Category: energy, Manufacturer: Swire-CC, and Brand: venomous blast.
from google.colab import auth
from google.cloud import bigquery
from google.colab import data_table
project = 'spring-swire-ca' # Project ID inserted based on the query results selected to explore
location = 'US' # Location inserted based on the query results selected to explore
client = bigquery.Client(project=project, location=location)
data_table.enable_dataframe_formatter()
auth.authenticate_user()
# Running this code will display the query used to generate your previous job
job = client.get_job('bquxjob_42824de3_18e9731e74d') # Job ID inserted based on the query results selected to explore
print(job.query)
SELECT DATE,SUM(UNIT_SALES) AS UNIT_SALES, SUM(DOLLAR_SALES) AS DOLLAR_SALES FROM `swirecc.fact_market_demand` WHERE MANUFACTURER = 'SWIRE-CC' AND CALORIC_SEGMENT = 'DIET/LIGHT' AND CATEGORY = 'ENERGY' AND BRAND = 'VENOMOUS BLAST' GROUP BY DATE;
# Running this code will read results from your previous job
job = client.get_job('bquxjob_42824de3_18e9731e74d') # Job ID inserted based on the query results selected to explore
results = job.to_dataframe()
results
| DATE | UNIT_SALES | DOLLAR_SALES | |
|---|---|---|---|
| 0 | 2022-08-20 | 3060.0 | 3343.27 |
| 1 | 2021-03-20 | 4584.0 | 4264.95 |
| 2 | 2021-10-30 | 3433.0 | 3102.08 |
| 3 | 2021-08-21 | 4333.0 | 4032.92 |
| 4 | 2021-03-27 | 3935.0 | 3694.53 |
| ... | ... | ... | ... |
| 134 | 2022-06-18 | 3231.0 | 3555.89 |
| 135 | 2021-07-10 | 3985.0 | 3689.87 |
| 136 | 2022-10-08 | 2869.0 | 3123.34 |
| 137 | 2022-09-24 | 2594.0 | 2842.74 |
| 138 | 2023-07-29 | 1767.0 | 1896.93 |
139 rows × 3 columns
We get the data from the google big query to this notebook and next we make changes to the dataframe 'result' by changing the 'DATE' format and dividing it into year, month and week.
import pandas as pd
# Converting 'DATE' column to datetime format
results['DATE'] = pd.to_datetime(results['DATE'])
# Extracing relevant features for forecasting
forecast_features = results[['DATE', 'UNIT_SALES', 'DOLLAR_SALES']]
# Add additional time-related features
forecast_features['YEAR'] = forecast_features['DATE'].dt.year
forecast_features['MONTH'] = forecast_features['DATE'].dt.month
forecast_features['WEEK_OF_YEAR'] = forecast_features['DATE'].dt.isocalendar().week
# Displaying the prepared forecasting features
print("Forecasting Features:")
forecast_features
Forecasting Features:
| DATE | UNIT_SALES | DOLLAR_SALES | YEAR | MONTH | WEEK_OF_YEAR | |
|---|---|---|---|---|---|---|
| 0 | 2022-08-20 | 3060.0 | 3343.27 | 2022 | 8 | 33 |
| 1 | 2021-03-20 | 4584.0 | 4264.95 | 2021 | 3 | 11 |
| 2 | 2021-10-30 | 3433.0 | 3102.08 | 2021 | 10 | 43 |
| 3 | 2021-08-21 | 4333.0 | 4032.92 | 2021 | 8 | 33 |
| 4 | 2021-03-27 | 3935.0 | 3694.53 | 2021 | 3 | 12 |
| ... | ... | ... | ... | ... | ... | ... |
| 134 | 2022-06-18 | 3231.0 | 3555.89 | 2022 | 6 | 24 |
| 135 | 2021-07-10 | 3985.0 | 3689.87 | 2021 | 7 | 27 |
| 136 | 2022-10-08 | 2869.0 | 3123.34 | 2022 | 10 | 40 |
| 137 | 2022-09-24 | 2594.0 | 2842.74 | 2022 | 9 | 38 |
| 138 | 2023-07-29 | 1767.0 | 1896.93 | 2023 | 7 | 30 |
139 rows × 6 columns
We follow the same pattern for the import of the dataset from the google big query to this notebook and exporting year and month and week from the 'DATE' column throughout the notebook.
Prophet is a procedure for forecasting time series data based on an additive model where non-linear trends are fit with yearly, weekly, and daily seasonality, plus holiday effects. It works best with time series that have strong seasonal effects and several seasons of historical data. Prophet is robust to missing data and shifts in the trend, and typically handles outliers well.
import pandas as pd
from prophet import Prophet
import matplotlib.pyplot as plt
# Converting the 'DATE' column to datetime and sort by date
forecast_features['DATE'] = pd.to_datetime(forecast_features['DATE'])
forecast_features.sort_values('DATE', inplace=True)
# Preparing the DataFrame for Prophet's convention
df_prophet_unit = forecast_features[['DATE', 'UNIT_SALES']].rename(columns={'DATE': 'ds', 'UNIT_SALES': 'y'})
df_prophet_dollar = forecast_features[['DATE', 'DOLLAR_SALES']].rename(columns={'DATE': 'ds', 'DOLLAR_SALES': 'y'})
# Fit the Prophet model for unit sales
prophet_model_unit = Prophet(yearly_seasonality=True, weekly_seasonality=True)
prophet_model_unit.fit(df_prophet_unit)
# Fit the Prophet model for dollar sales
prophet_model_dollar = Prophet(yearly_seasonality=True, weekly_seasonality=True)
prophet_model_dollar.fit(df_prophet_dollar)
# Create a future dataframe for one year and make predictions
future_unit = prophet_model_unit.make_future_dataframe(periods=365)
future_dollar = prophet_model_dollar.make_future_dataframe(periods=365)
forecast_unit = prophet_model_unit.predict(future_unit)
forecast_dollar = prophet_model_dollar.predict(future_dollar)
# Get the last historical date
last_historical_date = df_prophet_unit['ds'].max()
# Function to find the best 13 weeks within the forecast period
def find_best_13_weeks(forecast, last_historical_date):
# Restrict to forecasted data after the last historical date
forecast_future = forecast[forecast['ds'] > last_historical_date]
forecast_future['rolling_sum'] = forecast_future['yhat'].rolling(window=13, min_periods=1).sum()
best_period_idx = forecast_future['rolling_sum'].idxmax()
best_period_start = forecast_future.loc[best_period_idx]['ds']
best_period_end = forecast_future.loc[best_period_idx]['ds'] + pd.DateOffset(weeks=12) # Include the end week
return best_period_start, best_period_end
# Find the best 13 weeks for unit sales and dollar sales
best_start_unit, best_end_unit = find_best_13_weeks(forecast_unit, last_historical_date)
best_start_dollar, best_end_dollar = find_best_13_weeks(forecast_dollar, last_historical_date)
# Plotting function
def plot_prophet_forecast_with_highlights(model, forecast, best_start, best_end, title):
fig = model.plot(forecast)
plt.axvspan(best_start, best_end, color='orange', alpha=0.3, label='Best 13 Weeks')
plt.title(title)
plt.xlabel('Date')
plt.ylabel('Sales')
plt.legend()
plt.show()
# Plot the forecasts with the best 13 weeks highlighted
plot_prophet_forecast_with_highlights(prophet_model_unit, forecast_unit, best_start_unit, best_end_unit, 'Prophet Forecast for Unit Sales with Best 13 Weeks Highlighted')
plot_prophet_forecast_with_highlights(prophet_model_dollar, forecast_dollar, best_start_dollar, best_end_dollar, 'Prophet Forecast for Dollar Sales with Best 13 Weeks Highlighted')
INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this. DEBUG:cmdstanpy:input tempfile: /tmp/tmpshbwn_60/0wtg_e7n.json DEBUG:cmdstanpy:input tempfile: /tmp/tmpshbwn_60/xlpmffl1.json DEBUG:cmdstanpy:idx 0 DEBUG:cmdstanpy:running CmdStan, num_threads: None DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.10/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=94615', 'data', 'file=/tmp/tmpshbwn_60/0wtg_e7n.json', 'init=/tmp/tmpshbwn_60/xlpmffl1.json', 'output', 'file=/tmp/tmpshbwn_60/prophet_modelt5dp6tye/prophet_model-20240401010736.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000'] 01:07:36 - cmdstanpy - INFO - Chain [1] start processing INFO:cmdstanpy:Chain [1] start processing 01:07:37 - cmdstanpy - INFO - Chain [1] done processing INFO:cmdstanpy:Chain [1] done processing INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this. DEBUG:cmdstanpy:input tempfile: /tmp/tmpshbwn_60/b417jcdc.json DEBUG:cmdstanpy:input tempfile: /tmp/tmpshbwn_60/5y7q7vwo.json DEBUG:cmdstanpy:idx 0 DEBUG:cmdstanpy:running CmdStan, num_threads: None DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.10/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=34443', 'data', 'file=/tmp/tmpshbwn_60/b417jcdc.json', 'init=/tmp/tmpshbwn_60/5y7q7vwo.json', 'output', 'file=/tmp/tmpshbwn_60/prophet_modelymu7k9b6/prophet_model-20240401010737.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000'] 01:07:37 - cmdstanpy - INFO - Chain [1] start processing INFO:cmdstanpy:Chain [1] start processing 01:07:37 - cmdstanpy - INFO - Chain [1] done processing INFO:cmdstanpy:Chain [1] done processing <ipython-input-61-cb9e58d8778c>:34: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy forecast_future['rolling_sum'] = forecast_future['yhat'].rolling(window=13, min_periods=1).sum() <ipython-input-61-cb9e58d8778c>:34: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy forecast_future['rolling_sum'] = forecast_future['yhat'].rolling(window=13, min_periods=1).sum()
In the week of the months the of the November to January, the unit sales and dollar sales are the highest.
Lets evaluate model performance metrics
# Splitting the data into train and test sets
split_point = int(len(forecast_features) * 0.8) # for an 80/20 split
train = forecast_features.iloc[:split_point].copy() # Make a copy to avoid modifying the original DataFrame
test = forecast_features.iloc[split_point:].copy() # Make a copy to avoid modifying the original DataFrame
# Resetting index to ensure 'DATE' is a regular column
train.reset_index(inplace=True)
test.reset_index(inplace=True)
train.rename(columns={'DATE': 'ds', 'UNIT_SALES': 'y'}, inplace=True)
# Fit the Prophet model for UNIT_SALES on the training set
prophet_model_unit = Prophet(yearly_seasonality=True, weekly_seasonality=True)
prophet_model_unit.fit(train[['ds', 'y']]) # Ensure 'ds' and 'y' columns are selected
# Generate forecasts for the test set period
future_unit = prophet_model_unit.make_future_dataframe(periods=len(test))
forecast_unit = prophet_model_unit.predict(future_unit)
# Calculate MAE and MSE for UNIT_SALES
mae_unit = mean_absolute_error(test['UNIT_SALES'], forecast_unit['yhat'][-len(test):])
mse_unit = mean_squared_error(test['UNIT_SALES'], forecast_unit['yhat'][-len(test):])
# Repeat the process for DOLLAR_SALES
prophet_model_dollar = Prophet(yearly_seasonality=True, weekly_seasonality=True)
prophet_model_dollar.fit(train[['ds', 'y']]) # Ensure 'ds' and 'y' columns are selected
future_dollar = prophet_model_dollar.make_future_dataframe(periods=len(test))
forecast_dollar = prophet_model_dollar.predict(future_dollar)
mae_dollar = mean_absolute_error(test['DOLLAR_SALES'], forecast_dollar['yhat'][-len(test):])
mse_dollar = mean_squared_error(test['DOLLAR_SALES'], forecast_dollar['yhat'][-len(test):])
INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this. DEBUG:cmdstanpy:input tempfile: /tmp/tmpshbwn_60/kk3zzpc6.json DEBUG:cmdstanpy:input tempfile: /tmp/tmpshbwn_60/s7bg3z4e.json DEBUG:cmdstanpy:idx 0 DEBUG:cmdstanpy:running CmdStan, num_threads: None DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.10/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=13272', 'data', 'file=/tmp/tmpshbwn_60/kk3zzpc6.json', 'init=/tmp/tmpshbwn_60/s7bg3z4e.json', 'output', 'file=/tmp/tmpshbwn_60/prophet_modelfd5t083b/prophet_model-20240401010821.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000'] 01:08:21 - cmdstanpy - INFO - Chain [1] start processing INFO:cmdstanpy:Chain [1] start processing 01:08:21 - cmdstanpy - INFO - Chain [1] done processing INFO:cmdstanpy:Chain [1] done processing INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this. DEBUG:cmdstanpy:input tempfile: /tmp/tmpshbwn_60/1zat6fjg.json DEBUG:cmdstanpy:input tempfile: /tmp/tmpshbwn_60/zkeo8nmu.json DEBUG:cmdstanpy:idx 0 DEBUG:cmdstanpy:running CmdStan, num_threads: None DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.10/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=3303', 'data', 'file=/tmp/tmpshbwn_60/1zat6fjg.json', 'init=/tmp/tmpshbwn_60/zkeo8nmu.json', 'output', 'file=/tmp/tmpshbwn_60/prophet_model9jdle0gm/prophet_model-20240401010822.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000'] 01:08:22 - cmdstanpy - INFO - Chain [1] start processing INFO:cmdstanpy:Chain [1] start processing 01:08:22 - cmdstanpy - INFO - Chain [1] done processing INFO:cmdstanpy:Chain [1] done processing
# Printing the evaluation metrics
print(f'UNIT_SALES - MAE: {mae_unit}, MSE: {mse_unit}')
print(f'DOLLAR_SALES - MAE: {mae_dollar}, MSE: {mse_dollar}')
UNIT_SALES - MAE: 184.12940565230494, MSE: 53759.618137118676 DOLLAR_SALES - MAE: 226.95544492763685, MSE: 91284.82288173026
The MAE and MSE values for unit sales are 184 and 53759.For dollor sales the respected values are 226 and 91284.
Exponential smoothing is a forecasting method that uses weighted averages of past observations to predict new values. It is most effective when the values of the time series follow a gradual trend and display seasonal behavior in which the values follow a repeated cyclical pattern over a given number of time steps.
import pandas as pd
import matplotlib.pyplot as plt
from statsmodels.tsa.holtwinters import ExponentialSmoothing
# Ensuring the DATE column is in datetime format and set as the DataFrame's index
forecast_features['DATE'] = pd.to_datetime(forecast_features['DATE'])
forecast_features.set_index('DATE', inplace=True)
# Sorting the DataFrame by the datetime index
forecast_features.sort_index(inplace=True)
# Define the last date in the DataFrame
last_date = forecast_features.index.max()
# Prepare the forecast index for the next year after the last date
forecast_index = pd.date_range(start=last_date, periods=53, freq='W-SUN')[1:]
# Exponential Smoothing Forecast for UNIT_SALES
exp_model = ExponentialSmoothing(
forecast_features['UNIT_SALES'],
trend='add',
seasonal='add',
seasonal_periods=52
).fit()
exp_forecast = exp_model.forecast(52)
exp_forecast.index = forecast_index
# Exponential Smoothing Forecast for DOLLAR_SALES
exp_model_dollar = ExponentialSmoothing(
forecast_features['DOLLAR_SALES'],
trend='add',
seasonal='add',
seasonal_periods=52
).fit()
exp_forecast_dollar = exp_model_dollar.forecast(52)
exp_forecast_dollar.index = forecast_index
# Function to find the best 13 weeks
def find_best_13_weeks(forecast):
rolling_sum = forecast.rolling(window=13, min_periods=1).sum()
best_period_end = rolling_sum.idxmax()
best_period_start = best_period_end - pd.DateOffset(weeks=12) # 13 weeks include the end week
return best_period_start, best_period_end
# Find the best 13 weeks for unit sales
best_start_unit, best_end_unit = find_best_13_weeks(exp_forecast)
# Find the best 13 weeks for dollar sales
best_start_dollar, best_end_dollar = find_best_13_weeks(exp_forecast_dollar)
# Plotting function
def plot_forecast_with_highlights(forecast, best_start, best_end, title):
plt.figure(figsize=(14, 7))
forecast = forecast.clip(lower=0)
plt.plot(forecast.index, forecast, label='Forecast')
plt.axvspan(best_start, best_end, color='orange', alpha=0.3, label='Best 13 Weeks')
plt.title(title)
plt.xlabel('Date')
plt.ylabel('Sales')
plt.legend()
plt.show()
# Plot the forecasts with the best thirteen weeks highlighted
plot_forecast_with_highlights(exp_forecast, best_start_unit, best_end_unit, 'Unit Sales Forecast with Best 13 Weeks Highlighted')
plot_forecast_with_highlights(exp_forecast_dollar, best_start_dollar, best_end_dollar, 'Dollar Sales Forecast with Best 13 Weeks Highlighted')
Here also the best sales are from the November to January for both dollar and unit sales.
# Define the function to find the best 13 weeks
def find_best_13_weeks(forecast):
rolling_sum = forecast.rolling(window=13, min_periods=1).sum()
best_period_end = rolling_sum.idxmax()
best_period_start = best_period_end - pd.DateOffset(weeks=12)
return best_period_start, best_period_end, rolling_sum.max()
# Find the best 13 weeks for unit sales
best_start_unit, best_end_unit, max_sales_unit = find_best_13_weeks(exp_forecast)
# Find the best 13 weeks for dollar sales
best_start_dollar, best_end_dollar, max_sales_dollar = find_best_13_weeks(exp_forecast_dollar)
# Output the best periods and total sales
print(f"Best 13 weeks for unit sales start on {best_start_unit.date()} and end on {best_end_unit.date()}, with total sales: {max_sales_unit}")
print(f"Best 13 weeks for dollar sales start on {best_start_dollar.date()} and end on {best_end_dollar.date()}, with total sales: {max_sales_dollar}")
exp_forecast.index.freq = 'W-SUN'
exp_forecast_dollar.index.freq = 'W-SUN'
# Now, let's find the values and months for the best 13 weeks
best_13_weeks_values_unit = exp_forecast.loc[best_start_unit:best_end_unit]
best_13_weeks_values_dollar = exp_forecast_dollar.loc[best_start_dollar:best_end_dollar]
# Print out the results
print("Best 13 weeks for Unit Sales:")
print(best_13_weeks_values_unit)
print("\nBest 13 weeks for Dollar Sales:")
print(best_13_weeks_values_dollar)
# Extracting the month names for visualization
best_months_unit = best_13_weeks_values_unit.index.month_name().unique()
best_months_dollar = best_13_weeks_values_dollar.index.month_name().unique()
print("\nBest months for Unit Sales within the 13-week period:")
print(best_months_unit)
print("\nBest months for Dollar Sales within the 13-week period:")
print(best_months_dollar)
Best 13 weeks for unit sales start on 2023-11-05 and end on 2024-01-28, with total sales: 27089.59739698417 Best 13 weeks for dollar sales start on 2023-11-05 and end on 2024-01-28, with total sales: 25716.84464484714 Best 13 weeks for Unit Sales: 2023-11-05 2238.295810 2023-11-12 2374.804756 2023-11-19 2197.131115 2023-11-26 2089.043340 2023-12-03 2363.403338 2023-12-10 2428.576655 2023-12-17 2156.636819 2023-12-24 2098.744240 2023-12-31 1857.491347 2024-01-07 2357.688862 2024-01-14 2025.345332 2024-01-21 1532.255441 2024-01-28 1370.180342 Freq: W-SUN, dtype: float64 Best 13 weeks for Dollar Sales: 2023-11-05 2237.709606 2023-11-12 2322.717063 2023-11-19 2262.854639 2023-11-26 2060.134314 2023-12-03 2302.053302 2023-12-10 2361.575892 2023-12-17 2001.013098 2023-12-24 1908.227347 2023-12-31 1753.311512 2024-01-07 2080.857676 2024-01-14 1856.926463 2024-01-21 1386.450884 2024-01-28 1183.012849 Freq: W-SUN, dtype: float64 Best months for Unit Sales within the 13-week period: Index(['November', 'December', 'January'], dtype='object') Best months for Dollar Sales within the 13-week period: Index(['November', 'December', 'January'], dtype='object')
The total sales in this 13 weeks are 27089 units and dollar sales are 25716 dollars.
Let's evaluate the model performance.
# Splitting the data into train and test sets
split_point = int(len(forecast_features) * 0.8) # for an 80/20 split
train = forecast_features.iloc[:split_point]
test = forecast_features.iloc[split_point:]
# Define the parameters for the Exponential Smoothing model
params = {'trend': 'add', 'seasonal': 'add', 'seasonal_periods': 52}
# Fit the model on the training set for UNIT_SALES
exp_model_unit = ExponentialSmoothing(train['UNIT_SALES'], **params).fit()
# Generate forecasts for the test set period
unit_sales_forecast = exp_model_unit.forecast(len(test))
# Calculate MAE and MSE for UNIT_SALES
mae_unit = mean_absolute_error(test['UNIT_SALES'], unit_sales_forecast)
mse_unit = mean_squared_error(test['UNIT_SALES'], unit_sales_forecast)
# Repeat the process for DOLLAR_SALES
exp_model_dollar = ExponentialSmoothing(train['DOLLAR_SALES'], **params).fit()
dollar_sales_forecast = exp_model_dollar.forecast(len(test))
mae_dollar = mean_absolute_error(test['DOLLAR_SALES'], dollar_sales_forecast)
mse_dollar = mean_squared_error(test['DOLLAR_SALES'], dollar_sales_forecast)
# Print the evaluation metrics
print(f'UNIT_SALES - MAE: {mae_unit}, MSE: {mse_unit}')
print(f'DOLLAR_SALES - MAE: {mae_dollar}, MSE: {mse_dollar}')
UNIT_SALES - MAE: 239.7690087025342, MSE: 88172.0705520971 DOLLAR_SALES - MAE: 345.0714391443558, MSE: 184260.08837798057
The MAE and MSE values are 239 and 88172 for dollar sales and 345 and 184260 for the dollar sales.
ARIMA stands for Autoregressive Integrated Moving Average. It's a popular and powerful time series forecasting technique used for modeling and predicting time series data. ARIMA models are particularly effective for stationary time series data, meaning the statistical properties of the series such as mean and variance are constant over time.
import pandas as pd
import matplotlib.pyplot as plt
from statsmodels.tsa.arima.model import ARIMA
from sklearn.metrics import mean_absolute_error, mean_squared_error
# Ensuring the DATE column is in datetime format and set as the DataFrame's index
forecast_features['DATE'] = pd.to_datetime(forecast_features['DATE'])
forecast_features.set_index('DATE', inplace=True)
# Sort the DataFrame by the datetime index just to be sure
forecast_features.sort_index(inplace=True)
# Define the ARIMA model for UNIT_SALES
arima_model_unit = ARIMA(forecast_features['UNIT_SALES'], order=(1, 1, 52))
arima_result_unit = arima_model_unit.fit()
# Define the ARIMA model for DOLLAR_SALES
arima_model_dollar = ARIMA(forecast_features['DOLLAR_SALES'], order=(1, 1, 52))
arima_result_dollar = arima_model_dollar.fit()
# Define the date range for the next year after the last date in the dataset
last_date = forecast_features.index.max()
forecast_dates = pd.date_range(start=last_date + pd.Timedelta(days=1), periods=52, freq='W')
# Forecast the next 52 periods (assuming weekly data)
arima_forecast_unit = arima_result_unit.get_forecast(steps=52).predicted_mean
arima_forecast_dollar = arima_result_dollar.get_forecast(steps=52).predicted_mean
# Convert forecasts to pandas Series with a DateTimeIndex
arima_forecast_unit = pd.Series(arima_forecast_unit.values, index=forecast_dates)
arima_forecast_dollar = pd.Series(arima_forecast_dollar.values, index=forecast_dates)
# Check if rolling sum calculation is possible
rolling_sum = arima_forecast_unit.rolling(window=13, min_periods=1).sum()
best_period_end = rolling_sum.idxmax()
if pd.notnull(best_period_end):
best_period_start = best_period_end - pd.DateOffset(weeks=12)
# Plot ARIMA forecast with the best 13 weeks highlighted
plt.figure(figsize=(10, 6))
plt.plot(forecast_features.index, forecast_features['UNIT_SALES'], label='Actual Unit Sales', color='blue')
plt.plot(arima_forecast_unit.index, arima_forecast_unit, label='ARIMA Forecast', color='red')
plt.axvspan(best_period_start, best_period_end, color='yellow', alpha=0.3, label='Best 13 Weeks')
plt.title('ARIMA Forecast for Unit Sales with Best 13 Weeks Highlighted')
plt.xlabel('Date')
plt.ylabel('Unit Sales')
plt.legend()
plt.show()
else:
print("No best period is there")
The best 13 weeks in ARIMA model is November to January have the best unit sales.
import pandas as pd
import matplotlib.pyplot as plt
# Define the ARIMA model for DOLLAR_SALES
arima_model_dollar = ARIMA(forecast_features['DOLLAR_SALES'], order=(1, 1, 1), seasonal_order=(1, 1, 1, 52))
arima_result_dollar = arima_model_dollar.fit()
# Forecast the next 52 periods (assuming weekly data)
arima_forecast_dollar = arima_result_dollar.get_forecast(steps=52).predicted_mean
# Convert forecast to pandas Series with a DateTimeIndex
forecast_dates = pd.date_range(start=last_date + pd.Timedelta(days=1), periods=52, freq='W')
arima_forecast_dollar = pd.Series(arima_forecast_dollar.values, index=forecast_dates)
# Calculate the rolling sum over 13-week periods to find the best period for dollar sales
rolling_sum_dollar = arima_forecast_dollar.rolling(window=13, min_periods=1).sum()
best_period_end_dollar = rolling_sum_dollar.idxmax()
if pd.notnull(best_period_end_dollar):
best_period_start_dollar = best_period_end_dollar - pd.DateOffset(weeks=12)
# Plot the ARIMA forecast for DOLLAR_SALES with the best 13 weeks highlighted
plt.figure(figsize=(10, 6))
plt.plot(forecast_features.index, forecast_features['DOLLAR_SALES'], label='Actual Dollar Sales', color='blue')
plt.plot(arima_forecast_dollar.index, arima_forecast_dollar, label='ARIMA Forecast', color='red')
plt.axvspan(best_period_start_dollar, best_period_end_dollar, color='yellow', alpha=0.3, label='Best 13 Weeks')
plt.title('ARIMA Forecast for Dollar Sales with Best 13 Weeks Highlighted')
plt.xlabel('Date')
plt.ylabel('Dollar Sales')
plt.legend()
plt.show()
else:
print("No best period is there")
The best 13 weeks in ARIMA model is November to January have the best dollar sales.
import pandas as pd
# Function to calculate the best 13-week period for any given forecast series
def calculate_best_13_weeks(forecast_series):
rolling_sum = forecast_series.rolling(window=13, min_periods=1).sum()
max_sum_index = rolling_sum.idxmax()
max_sum_value = rolling_sum.max()
start_of_best_period = max_sum_index - pd.DateOffset(weeks=12) # 13 weeks including the end week
return start_of_best_period, max_sum_index, max_sum_value
# Calculate for Unit Sales
best_start_unit, best_end_unit, best_sales_unit = calculate_best_13_weeks(arima_forecast_unit)
print(f"Best 13 Weeks for Unit Sales: {best_start_unit.date()} to {best_end_unit.date()}, Total Sales: {best_sales_unit}")
# Calculate for Dollar Sales
best_start_dollar, best_end_dollar, best_sales_dollar = calculate_best_13_weeks(arima_forecast_dollar)
print(f"Best 13 Weeks for Dollar Sales: {best_start_dollar.date()} to {best_end_dollar.date()}, Total Sales: {best_sales_dollar}")
Best 13 Weeks for Unit Sales: 2023-10-29 to 2024-01-21, Total Sales: 24357.8833125924 Best 13 Weeks for Dollar Sales: 2023-10-29 to 2024-01-21, Total Sales: 27014.166682422216
from statsmodels.tsa.arima.model import ARIMA
from sklearn.metrics import mean_squared_error, mean_absolute_error
import numpy as np
# Assuming forecast_features is your dataframe with a datetime index and UNIT_SALES and DOLLAR_SALES columns
data_unit_sales = forecast_features['UNIT_SALES']
data_dollar_sales = forecast_features['DOLLAR_SALES']
# Number of observations to leave out in each split for testing
n_splits = 5
# The order and seasonal order for ARIMA/SARIMA model
order = (1, 1, 1)
seasonal_order = (1, 1, 1, 52)
# Perform rolling forecast origin for unit sales
def rolling_forecast_origin(time_series, order, seasonal_order, n_splits):
history = time_series.iloc[:-n_splits].tolist()
predictions = []
test_set = time_series.iloc[-n_splits:].tolist()
for t in range(n_splits):
model = ARIMA(history, order=order, seasonal_order=seasonal_order)
model_fit = model.fit()
output = model_fit.forecast()
yhat = output[0]
predictions.append(yhat)
history.append(test_set[t])
mae = mean_absolute_error(test_set, predictions)
mse = mean_squared_error(test_set, predictions)
return predictions, mae, mse
# Perform rolling forecast for UNIT_SALES
predictions_unit, mae_unit, mse_unit = rolling_forecast_origin(data_unit_sales, order, seasonal_order, n_splits)
# Perform rolling forecast for DOLLAR_SALES
predictions_unit, mae_dollar, mse_dollar = rolling_forecast_origin(data_dollar_sales, order, seasonal_order, n_splits)
# Print the evaluation
print(f'ARIMA model MAE for UNIT_SALES: {mae_unit}')
print(f'ARIMA model MAE for DOLLAR_SALES: {mae_dollar}')
print(f'ARIMA model MSE for UNIT_SALES: {mse_unit}')
print(f'ARIMA model MSE for DOLLAR_SALES: {mse_dollar}')
ARIMA model MAE for UNIT_SALES: 159.96594909611218 ARIMA model MAE for DOLLAR_SALES: 154.39527519479262 ARIMA model MSE for UNIT_SALES: 41398.72850037805 ARIMA model MSE for DOLLAR_SALES: 41749.16367557272
The MAE values have decreased when compared to the other models. So, this is quite the good model.
From the models we can say that best 13 weeks are November to the January for the both dollar to unit sales for all the models used. And the best values are on 2023-11-05 and end on 2024-01-28 with 27089 unit sales and Best 13 weeks for dollar sales start on 2023-11-05 and end on 2024-01-28, with dollar sales: 25716.
Since, we don't have combinations of flavour 'Kiwano' with brand of 'Venomous Blast', so now we use the sales of flavour without the brand.
We will now try filters with the flavor 'Kiwano' with Swire - CC, Category 'Energy' and Diet/Light Caloric segment.
Before building the model to forecast the sales, let's check the data and find the similar products in the given dataset. Since, the dataset is large we use Google Big query to filter the data based on the requirements and import those filtered data to this notebook.
In the dataset provided to us we don't have dataset with combinations of Package Type: '16 liquid small'. So, we fist consider the other Caloric Segment: Diet, Market Category: energy, Manufacturer: Swire-CC, and flavour : kiwano
from google.colab import auth
from google.cloud import bigquery
from google.colab import data_table
project = 'spring-swire-ca' # Project ID inserted based on the query results selected to explore
location = 'US' # Location inserted based on the query results selected to explore
client = bigquery.Client(project=project, location=location)
data_table.enable_dataframe_formatter()
auth.authenticate_user()
# Running this code will display the query used to generate your previous job
job = client.get_job('bquxjob_47bb0cd1_18e9734ee35') # Job ID inserted based on the query results selected to explore
print(job.query)
SELECT DATE,SUM(UNIT_SALES) AS UNIT_SALES, SUM(DOLLAR_SALES) AS DOLLAR_SALES FROM `swirecc.fact_market_demand` WHERE ITEM LIKE '%KIWANO%' AND MANUFACTURER = 'SWIRE-CC' AND CALORIC_SEGMENT = 'DIET/LIGHT' AND CATEGORY = 'ENERGY' GROUP BY DATE;
# Running this code will read results from your previous job
job = client.get_job('bquxjob_47bb0cd1_18e9734ee35') # Job ID inserted based on the query results selected to explore
results = job.to_dataframe()
results
| DATE | UNIT_SALES | DOLLAR_SALES | |
|---|---|---|---|
| 0 | 2022-07-09 | 559.0 | 592.38 |
| 1 | 2023-02-11 | 620.0 | 637.37 |
| 2 | 2023-03-11 | 453.0 | 482.55 |
| 3 | 2023-02-04 | 399.0 | 413.73 |
| 4 | 2023-10-28 | 413.0 | 422.02 |
| ... | ... | ... | ... |
| 134 | 2021-09-11 | 805.0 | 703.29 |
| 135 | 2021-05-29 | 635.0 | 575.11 |
| 136 | 2023-04-08 | 433.0 | 468.86 |
| 137 | 2021-10-30 | 612.0 | 544.15 |
| 138 | 2021-02-13 | 568.0 | 520.90 |
139 rows × 3 columns
We get the data from the google big query to this notebook and next we make changes to the dataframe 'result' by changing the 'DATE' format and dividing it into year, month and week.
import pandas as pd
# Convert 'DATE' column to datetime format
results['DATE'] = pd.to_datetime(results['DATE'])
# Extract relevant features for forecasting
forecast_features = results[['DATE', 'UNIT_SALES', 'DOLLAR_SALES']]
# Add additional time-related features
forecast_features['YEAR'] = forecast_features['DATE'].dt.year
forecast_features['MONTH'] = forecast_features['DATE'].dt.month
forecast_features['WEEK_OF_YEAR'] = forecast_features['DATE'].dt.isocalendar().week
# Display the prepared forecasting features
print("Forecasting Features:")
forecast_features
Forecasting Features:
| DATE | UNIT_SALES | DOLLAR_SALES | YEAR | MONTH | WEEK_OF_YEAR | |
|---|---|---|---|---|---|---|
| 0 | 2022-07-09 | 559.0 | 592.38 | 2022 | 7 | 27 |
| 1 | 2023-02-11 | 620.0 | 637.37 | 2023 | 2 | 6 |
| 2 | 2023-03-11 | 453.0 | 482.55 | 2023 | 3 | 10 |
| 3 | 2023-02-04 | 399.0 | 413.73 | 2023 | 2 | 5 |
| 4 | 2023-10-28 | 413.0 | 422.02 | 2023 | 10 | 43 |
| ... | ... | ... | ... | ... | ... | ... |
| 134 | 2021-09-11 | 805.0 | 703.29 | 2021 | 9 | 36 |
| 135 | 2021-05-29 | 635.0 | 575.11 | 2021 | 5 | 21 |
| 136 | 2023-04-08 | 433.0 | 468.86 | 2023 | 4 | 14 |
| 137 | 2021-10-30 | 612.0 | 544.15 | 2021 | 10 | 43 |
| 138 | 2021-02-13 | 568.0 | 520.90 | 2021 | 2 | 6 |
139 rows × 6 columns
We follow the same pattern for the import of the dataset from the google big query to this notebook and exporting year and month and week from the 'DATE' column throughout the notebook.
Exponential smoothing is a forecasting method that uses weighted averages of past observations to predict new values. It is most effective when the values of the time series follow a gradual trend and display seasonal behavior in which the values follow a repeated cyclical pattern over a given number of time steps.
import pandas as pd
import matplotlib.pyplot as plt
from statsmodels.tsa.holtwinters import ExponentialSmoothing
# Ensuring the DATE column is in datetime format and set as the DataFrame's index
forecast_features['DATE'] = pd.to_datetime(forecast_features['DATE'])
forecast_features.set_index('DATE', inplace=True)
# Sort the DataFrame by the datetime index
forecast_features.sort_index(inplace=True)
# Define the last date in the DataFrame
last_date = forecast_features.index.max()
# Prepare the forecast index for the next year after the last date
forecast_index = pd.date_range(start=last_date, periods=53, freq='W-SUN')[1:]
# Exponential Smoothing Forecast for UNIT_SALES
exp_model = ExponentialSmoothing(
forecast_features['UNIT_SALES'],
trend='add',
seasonal='add',
seasonal_periods=52
).fit()
exp_forecast = exp_model.forecast(52)
exp_forecast.index = forecast_index
# Exponential Smoothing Forecast for DOLLAR_SALES
exp_model_dollar = ExponentialSmoothing(
forecast_features['DOLLAR_SALES'],
trend='add',
seasonal='add',
seasonal_periods=52
).fit()
exp_forecast_dollar = exp_model_dollar.forecast(52)
exp_forecast_dollar.index = forecast_index
# Function to find the best 13 weeks
def find_best_13_weeks(forecast):
rolling_sum = forecast.rolling(window=13, min_periods=1).sum()
best_period_end = rolling_sum.idxmax()
best_period_start = best_period_end - pd.DateOffset(weeks=12) # 13 weeks include the end week
return best_period_start, best_period_end
# Find the best 13 weeks for unit sales
best_start_unit, best_end_unit = find_best_13_weeks(exp_forecast)
# Find the best 13 weeks for dollar sales
best_start_dollar, best_end_dollar = find_best_13_weeks(exp_forecast_dollar)
# Plotting function
def plot_forecast_with_highlights(forecast, best_start, best_end, title):
plt.figure(figsize=(14, 7))
forecast = forecast.clip(lower=0) # Ensure no negative values in the forecast
plt.plot(forecast.index, forecast, label='Forecast')
plt.axvspan(best_start, best_end, color='orange', alpha=0.3, label='Best 13 Weeks')
plt.title(title)
plt.xlabel('Date')
plt.ylabel('Sales')
plt.legend()
plt.show()
# Plot the forecasts with the best thirteen weeks highlighted
plot_forecast_with_highlights(exp_forecast, best_start_unit, best_end_unit, 'Unit Sales Forecast with Best 13 Weeks Highlighted')
plot_forecast_with_highlights(exp_forecast_dollar, best_start_dollar, best_end_dollar, 'Dollar Sales Forecast with Best 13 Weeks Highlighted')
Here the best sales are from the November to January for both dollar and unit sales.
# Define the function to find the best 13 weeks
def find_best_13_weeks(forecast):
rolling_sum = forecast.rolling(window=13, min_periods=1).sum()
best_period_end = rolling_sum.idxmax()
best_period_start = best_period_end - pd.DateOffset(weeks=12)
return best_period_start, best_period_end, rolling_sum.max()
# Find the best 13 weeks for unit sales
best_start_unit, best_end_unit, max_sales_unit = find_best_13_weeks(exp_forecast)
# Find the best 13 weeks for dollar sales
best_start_dollar, best_end_dollar, max_sales_dollar = find_best_13_weeks(exp_forecast_dollar)
# Output the best periods and total sales
print(f"Best 13 weeks for unit sales start on {best_start_unit.date()} and end on {best_end_unit.date()}, with total sales: {max_sales_unit}")
print(f"Best 13 weeks for dollar sales start on {best_start_dollar.date()} and end on {best_end_dollar.date()}, with total sales: {max_sales_dollar}")
# Since 'forecast_index' doesn't have the frequency set, let's define it to ensure we can perform the rolling operation.
exp_forecast.index.freq = 'W-SUN'
exp_forecast_dollar.index.freq = 'W-SUN'
# Now, let's find the values and months for the best 13 weeks
best_13_weeks_values_unit = exp_forecast.loc[best_start_unit:best_end_unit]
best_13_weeks_values_dollar = exp_forecast_dollar.loc[best_start_dollar:best_end_dollar]
# Print out the results
print("Best 13 weeks for Unit Sales:")
print(best_13_weeks_values_unit)
print("\nBest 13 weeks for Dollar Sales:")
print(best_13_weeks_values_dollar)
# Extracting the month names for visualization
best_months_unit = best_13_weeks_values_unit.index.month_name().unique()
best_months_dollar = best_13_weeks_values_dollar.index.month_name().unique()
print("\nBest months for Unit Sales within the 13-week period:")
print(best_months_unit)
print("\nBest months for Dollar Sales within the 13-week period:")
print(best_months_dollar)
Best 13 weeks for unit sales start on 2023-11-05 and end on 2024-01-28, with total sales: 8614.014741194007 Best 13 weeks for dollar sales start on 2023-11-05 and end on 2024-01-28, with total sales: 8415.144403697363 Best 13 weeks for Unit Sales: 2023-11-05 542.921586 2023-11-12 633.432077 2023-11-19 614.065669 2023-11-26 572.752860 2023-12-03 733.184863 2023-12-10 769.885958 2023-12-17 651.010705 2023-12-24 738.873109 2023-12-31 706.283460 2024-01-07 800.169949 2024-01-14 799.041546 2024-01-21 496.021773 2024-01-28 556.371186 Freq: W-SUN, dtype: float64 Best 13 weeks for Dollar Sales: 2023-11-05 556.013072 2023-11-12 613.811228 2023-11-19 613.488906 2023-11-26 573.750007 2023-12-03 713.214290 2023-12-10 753.815524 2023-12-17 645.691097 2023-12-24 701.922308 2023-12-31 682.957786 2024-01-07 774.060050 2024-01-14 755.137250 2024-01-21 488.390387 2024-01-28 542.892500 Freq: W-SUN, dtype: float64 Best months for Unit Sales within the 13-week period: Index(['November', 'December', 'January'], dtype='object') Best months for Dollar Sales within the 13-week period: Index(['November', 'December', 'January'], dtype='object')
The total sales in this 13 weeks are 8614 units and dollar sales are 8415 dollars.
Let's evaluate the model performance.
# Split the data into train and test sets
split_point = int(len(forecast_features) * 0.8) # for an 80/20 split
train = forecast_features.iloc[:split_point]
test = forecast_features.iloc[split_point:]
# Define the parameters for the Exponential Smoothing model
params = {'trend': 'add', 'seasonal': 'add', 'seasonal_periods': 52}
# Fit the model on the training set for UNIT_SALES
exp_model_unit = ExponentialSmoothing(train['UNIT_SALES'], **params).fit()
# Generate forecasts for the test set period
unit_sales_forecast = exp_model_unit.forecast(len(test))
# Calculate MAE and MSE for UNIT_SALES
mae_unit = mean_absolute_error(test['UNIT_SALES'], unit_sales_forecast)
mse_unit = mean_squared_error(test['UNIT_SALES'], unit_sales_forecast)
# Repeat the process for DOLLAR_SALES
exp_model_dollar = ExponentialSmoothing(train['DOLLAR_SALES'], **params).fit()
dollar_sales_forecast = exp_model_dollar.forecast(len(test))
mae_dollar = mean_absolute_error(test['DOLLAR_SALES'], dollar_sales_forecast)
mse_dollar = mean_squared_error(test['DOLLAR_SALES'], dollar_sales_forecast)
# Print the evaluation metrics
print(f'UNIT_SALES - MAE: {mae_unit}, MSE: {mse_unit}')
print(f'DOLLAR_SALES - MAE: {mae_dollar}, MSE: {mse_dollar}')
UNIT_SALES - MAE: 78.47035335943062, MSE: 8222.930070753187 DOLLAR_SALES - MAE: 88.19229715697364, MSE: 10195.112930177784
The MAE and MSE values are 88 and 10195 for dollar sales and 78 and 8222 for the unit sales.
Prophet is a procedure for forecasting time series data based on an additive model where non-linear trends are fit with yearly, weekly, and daily seasonality, plus holiday effects. It works best with time series that have strong seasonal effects and several seasons of historical data. Prophet is robust to missing data and shifts in the trend, and typically handles outliers well.
import pandas as pd
from prophet import Prophet
import matplotlib.pyplot as plt
# Converting the 'DATE' column to datetime and sort by date
forecast_features['DATE'] = pd.to_datetime(forecast_features['DATE'])
forecast_features.sort_values('DATE', inplace=True)
# Prepare the DataFrame for Prophet's convention
df_prophet_unit = forecast_features[['DATE', 'UNIT_SALES']].rename(columns={'DATE': 'ds', 'UNIT_SALES': 'y'})
df_prophet_dollar = forecast_features[['DATE', 'DOLLAR_SALES']].rename(columns={'DATE': 'ds', 'DOLLAR_SALES': 'y'})
# Fit the Prophet model for unit sales
prophet_model_unit = Prophet(yearly_seasonality=True, weekly_seasonality=True)
prophet_model_unit.fit(df_prophet_unit)
# Fit the Prophet model for dollar sales
prophet_model_dollar = Prophet(yearly_seasonality=True, weekly_seasonality=True)
prophet_model_dollar.fit(df_prophet_dollar)
# Create a future dataframe for one year and make predictions
future_unit = prophet_model_unit.make_future_dataframe(periods=365)
future_dollar = prophet_model_dollar.make_future_dataframe(periods=365)
forecast_unit = prophet_model_unit.predict(future_unit)
forecast_dollar = prophet_model_dollar.predict(future_dollar)
# Get the last historical date
last_historical_date = df_prophet_unit['ds'].max()
# Function to find the best 13 weeks within the forecast period
def find_best_13_weeks(forecast, last_historical_date):
# Restrict to forecasted data after the last historical date
forecast_future = forecast[forecast['ds'] > last_historical_date]
forecast_future['rolling_sum'] = forecast_future['yhat'].rolling(window=13, min_periods=1).sum()
best_period_idx = forecast_future['rolling_sum'].idxmax()
best_period_start = forecast_future.loc[best_period_idx]['ds']
best_period_end = forecast_future.loc[best_period_idx]['ds'] + pd.DateOffset(weeks=12) # Include the end week
return best_period_start, best_period_end
# Find the best 13 weeks for unit sales and dollar sales
best_start_unit, best_end_unit = find_best_13_weeks(forecast_unit, last_historical_date)
best_start_dollar, best_end_dollar = find_best_13_weeks(forecast_dollar, last_historical_date)
# Plotting function
def plot_prophet_forecast_with_highlights(model, forecast, best_start, best_end, title):
fig = model.plot(forecast)
plt.axvspan(best_start, best_end, color='orange', alpha=0.3, label='Best 13 Weeks')
plt.title(title)
plt.xlabel('Date')
plt.ylabel('Sales')
plt.legend()
plt.show()
# Plot the forecasts with the best 13 weeks highlighted
plot_prophet_forecast_with_highlights(prophet_model_unit, forecast_unit, best_start_unit, best_end_unit, 'Prophet Forecast for Unit Sales with Best 13 Weeks Highlighted')
plot_prophet_forecast_with_highlights(prophet_model_dollar, forecast_dollar, best_start_dollar, best_end_dollar, 'Prophet Forecast for Dollar Sales with Best 13 Weeks Highlighted')
INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this. DEBUG:cmdstanpy:input tempfile: /tmp/tmpshbwn_60/hbz3mz33.json DEBUG:cmdstanpy:input tempfile: /tmp/tmpshbwn_60/n0u_mp2a.json DEBUG:cmdstanpy:idx 0 DEBUG:cmdstanpy:running CmdStan, num_threads: None DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.10/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=86890', 'data', 'file=/tmp/tmpshbwn_60/hbz3mz33.json', 'init=/tmp/tmpshbwn_60/n0u_mp2a.json', 'output', 'file=/tmp/tmpshbwn_60/prophet_modeli7v1eqir/prophet_model-20240401011110.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000'] 01:11:10 - cmdstanpy - INFO - Chain [1] start processing INFO:cmdstanpy:Chain [1] start processing 01:11:10 - cmdstanpy - INFO - Chain [1] done processing INFO:cmdstanpy:Chain [1] done processing INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this. DEBUG:cmdstanpy:input tempfile: /tmp/tmpshbwn_60/6f1w19i8.json DEBUG:cmdstanpy:input tempfile: /tmp/tmpshbwn_60/cq8vg3ym.json DEBUG:cmdstanpy:idx 0 DEBUG:cmdstanpy:running CmdStan, num_threads: None DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.10/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=25815', 'data', 'file=/tmp/tmpshbwn_60/6f1w19i8.json', 'init=/tmp/tmpshbwn_60/cq8vg3ym.json', 'output', 'file=/tmp/tmpshbwn_60/prophet_modelooi9inad/prophet_model-20240401011110.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000'] 01:11:10 - cmdstanpy - INFO - Chain [1] start processing INFO:cmdstanpy:Chain [1] start processing 01:11:10 - cmdstanpy - INFO - Chain [1] done processing INFO:cmdstanpy:Chain [1] done processing <ipython-input-67-159f9d2a9899>:34: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy forecast_future['rolling_sum'] = forecast_future['yhat'].rolling(window=13, min_periods=1).sum() <ipython-input-67-159f9d2a9899>:34: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy forecast_future['rolling_sum'] = forecast_future['yhat'].rolling(window=13, min_periods=1).sum()
In the week of the months the of the october to december, the unit sales and dollar sales are the highest.
Lets evaluate model performance metrics
# Split the data into train and test sets
split_point = int(len(forecast_features) * 0.8) # for an 80/20 split
train = forecast_features.iloc[:split_point].copy() # Make a copy to avoid modifying the original DataFrame
test = forecast_features.iloc[split_point:].copy() # Make a copy to avoid modifying the original DataFrame
# Reset index to ensure 'DATE' is a regular column
train.reset_index(inplace=True)
test.reset_index(inplace=True)
train.rename(columns={'DATE': 'ds', 'UNIT_SALES': 'y'}, inplace=True)
# Fit the Prophet model for UNIT_SALES on the training set
prophet_model_unit = Prophet(yearly_seasonality=True, weekly_seasonality=True)
prophet_model_unit.fit(train[['ds', 'y']]) # Ensure 'ds' and 'y' columns are selected
# Generate forecasts for the test set period
future_unit = prophet_model_unit.make_future_dataframe(periods=len(test))
forecast_unit = prophet_model_unit.predict(future_unit)
# Calculate MAE and MSE for UNIT_SALES
mae_unit = mean_absolute_error(test['UNIT_SALES'], forecast_unit['yhat'][-len(test):])
mse_unit = mean_squared_error(test['UNIT_SALES'], forecast_unit['yhat'][-len(test):])
# Repeat the process for DOLLAR_SALES
prophet_model_dollar = Prophet(yearly_seasonality=True, weekly_seasonality=True)
prophet_model_dollar.fit(train[['ds', 'y']]) # Ensure 'ds' and 'y' columns are selected
future_dollar = prophet_model_dollar.make_future_dataframe(periods=len(test))
forecast_dollar = prophet_model_dollar.predict(future_dollar)
mae_dollar = mean_absolute_error(test['DOLLAR_SALES'], forecast_dollar['yhat'][-len(test):])
mse_dollar = mean_squared_error(test['DOLLAR_SALES'], forecast_dollar['yhat'][-len(test):])
INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this. DEBUG:cmdstanpy:input tempfile: /tmp/tmpshbwn_60/nfedlo7e.json DEBUG:cmdstanpy:input tempfile: /tmp/tmpshbwn_60/fzmm9x9w.json DEBUG:cmdstanpy:idx 0 DEBUG:cmdstanpy:running CmdStan, num_threads: None DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.10/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=20188', 'data', 'file=/tmp/tmpshbwn_60/nfedlo7e.json', 'init=/tmp/tmpshbwn_60/fzmm9x9w.json', 'output', 'file=/tmp/tmpshbwn_60/prophet_modelmnfy9409/prophet_model-20240401011221.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000'] 01:12:21 - cmdstanpy - INFO - Chain [1] start processing INFO:cmdstanpy:Chain [1] start processing 01:12:21 - cmdstanpy - INFO - Chain [1] done processing INFO:cmdstanpy:Chain [1] done processing INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this. DEBUG:cmdstanpy:input tempfile: /tmp/tmpshbwn_60/c39bwxu0.json DEBUG:cmdstanpy:input tempfile: /tmp/tmpshbwn_60/h2uqdf4q.json DEBUG:cmdstanpy:idx 0 DEBUG:cmdstanpy:running CmdStan, num_threads: None DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.10/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=42207', 'data', 'file=/tmp/tmpshbwn_60/c39bwxu0.json', 'init=/tmp/tmpshbwn_60/h2uqdf4q.json', 'output', 'file=/tmp/tmpshbwn_60/prophet_modelkd1y9xxn/prophet_model-20240401011221.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000'] 01:12:21 - cmdstanpy - INFO - Chain [1] start processing INFO:cmdstanpy:Chain [1] start processing 01:12:21 - cmdstanpy - INFO - Chain [1] done processing INFO:cmdstanpy:Chain [1] done processing
# Print the evaluation metrics
print(f'UNIT_SALES - MAE: {mae_unit}, MSE: {mse_unit}')
print(f'DOLLAR_SALES - MAE: {mae_dollar}, MSE: {mse_dollar}')
UNIT_SALES - MAE: 51.40720349079181, MSE: 4829.472102122382 DOLLAR_SALES - MAE: 62.38062539059109, MSE: 7005.025188863927
The MAE and MSE values for unit sales are 51 and 4829.For dollor sales the respected values are 62 and 7005
ARIMA stands for Autoregressive Integrated Moving Average. It's a popular and powerful time series forecasting technique used for modeling and predicting time series data. ARIMA models are particularly effective for stationary time series data, meaning the statistical properties of the series such as mean and variance are constant over time.
import pandas as pd
import matplotlib.pyplot as plt
# Ensure the DATE column is in datetime format and set as the DataFrame's index
forecast_features['DATE'] = pd.to_datetime(forecast_features['DATE'])
forecast_features.set_index('DATE', inplace=True)
# Sort the DataFrame by the datetime index just to be sure
forecast_features.sort_index(inplace=True)
# Define the ARIMA model for UNIT_SALES
arima_model_unit = ARIMA(forecast_features['UNIT_SALES'], order=(1, 1, 52))
arima_result_unit = arima_model_unit.fit()
# Define the ARIMA model for DOLLAR_SALES
arima_model_dollar = ARIMA(forecast_features['DOLLAR_SALES'], order=(1, 1, 52))
arima_result_dollar = arima_model_dollar.fit()
# Define the date range for the next year after the last date in the dataset
last_date = forecast_features.index.max()
forecast_dates = pd.date_range(start=last_date + pd.Timedelta(days=1), periods=52, freq='W')
# Forecast the next 52 periods (assuming weekly data)
arima_forecast_unit = arima_result_unit.get_forecast(steps=52).predicted_mean
arima_forecast_dollar = arima_result_dollar.get_forecast(steps=52).predicted_mean
# Convert forecasts to pandas Series with a DateTimeIndex
arima_forecast_unit = pd.Series(arima_forecast_unit.values, index=forecast_dates)
arima_forecast_dollar = pd.Series(arima_forecast_dollar.values, index=forecast_dates)
# Check if rolling sum calculation is possible
rolling_sum = arima_forecast_unit.rolling(window=13, min_periods=1).sum()
best_period_end = rolling_sum.idxmax()
if pd.notnull(best_period_end):
best_period_start = best_period_end - pd.DateOffset(weeks=12)
# Plot ARIMA forecast with the best 13 weeks highlighted
plt.figure(figsize=(10, 6))
plt.plot(forecast_features.index, forecast_features['UNIT_SALES'], label='Actual Unit Sales', color='blue')
plt.plot(arima_forecast_unit.index, arima_forecast_unit, label='ARIMA Forecast', color='red')
plt.axvspan(best_period_start, best_period_end, color='yellow', alpha=0.3, label='Best 13 Weeks')
plt.title('ARIMA Forecast for Unit Sales with Best 13 Weeks Highlighted')
plt.xlabel('Date')
plt.ylabel('Unit Sales')
plt.legend()
plt.show()
else:
print("No best period is there")
The best 13 weeks in ARIMA model is November to January have the best unit sales.
import pandas as pd
import matplotlib.pyplot as plt
# Define the ARIMA model for DOLLAR_SALES
arima_model_dollar = ARIMA(forecast_features['DOLLAR_SALES'], order=(1, 1, 1), seasonal_order=(1, 1, 1, 52))
arima_result_dollar = arima_model_dollar.fit()
# Forecast the next 52 periods (assuming weekly data)
arima_forecast_dollar = arima_result_dollar.get_forecast(steps=52).predicted_mean
# Convert forecast to pandas Series with a DateTimeIndex
forecast_dates = pd.date_range(start=last_date + pd.Timedelta(days=1), periods=52, freq='W')
arima_forecast_dollar = pd.Series(arima_forecast_dollar.values, index=forecast_dates)
# Calculate the rolling sum over 13-week periods to find the best period for dollar sales
rolling_sum_dollar = arima_forecast_dollar.rolling(window=13, min_periods=1).sum()
best_period_end_dollar = rolling_sum_dollar.idxmax()
if pd.notnull(best_period_end_dollar):
best_period_start_dollar = best_period_end_dollar - pd.DateOffset(weeks=12)
# Plot the ARIMA forecast for DOLLAR_SALES with the best 13 weeks highlighted
plt.figure(figsize=(10, 6))
plt.plot(forecast_features.index, forecast_features['DOLLAR_SALES'], label='Actual Dollar Sales', color='blue')
plt.plot(arima_forecast_dollar.index, arima_forecast_dollar, label='ARIMA Forecast', color='red')
plt.axvspan(best_period_start_dollar, best_period_end_dollar, color='yellow', alpha=0.3, label='Best 13 Weeks')
plt.title('ARIMA Forecast for Dollar Sales with Best 13 Weeks Highlighted')
plt.xlabel('Date')
plt.ylabel('Dollar Sales')
plt.legend()
plt.show()
else:
print("No best period is there")
The best 13 weeks in ARIMA model is November to January have the best dollar sales.
import pandas as pd
# Function to calculate the best 13-week period for any given forecast series
def calculate_best_13_weeks(forecast_series):
rolling_sum = forecast_series.rolling(window=13, min_periods=1).sum()
max_sum_index = rolling_sum.idxmax()
max_sum_value = rolling_sum.max()
start_of_best_period = max_sum_index - pd.DateOffset(weeks=12) # 13 weeks including the end week
return start_of_best_period, max_sum_index, max_sum_value
# Calculate for Unit Sales
best_start_unit, best_end_unit, best_sales_unit = calculate_best_13_weeks(arima_forecast_unit)
print(f"Best 13 Weeks for Unit Sales: {best_start_unit.date()} to {best_end_unit.date()}, Total Sales: {best_sales_unit}")
# Calculate for Dollar Sales
best_start_dollar, best_end_dollar, best_sales_dollar = calculate_best_13_weeks(arima_forecast_dollar)
print(f"Best 13 Weeks for Dollar Sales: {best_start_dollar.date()} to {best_end_dollar.date()}, Total Sales: {best_sales_dollar}")
Best 13 Weeks for Unit Sales: 2023-11-05 to 2024-01-28, Total Sales: 6254.373421893271 Best 13 Weeks for Dollar Sales: 2023-10-29 to 2024-01-21, Total Sales: 8603.569286046462
from statsmodels.tsa.arima.model import ARIMA
from sklearn.metrics import mean_squared_error, mean_absolute_error
import numpy as np
# Assuming forecast_features is your dataframe with a datetime index and UNIT_SALES and DOLLAR_SALES columns
data_unit_sales = forecast_features['UNIT_SALES']
data_dollar_sales = forecast_features['DOLLAR_SALES']
# Number of observations to leave out in each split for testing
n_splits = 5
# The order and seasonal order for ARIMA/SARIMA model
order = (1, 1, 1)
seasonal_order = (1, 1, 1, 52)
# Perform rolling forecast origin for unit sales
def rolling_forecast_origin(time_series, order, seasonal_order, n_splits):
history = time_series.iloc[:-n_splits].tolist()
predictions = []
test_set = time_series.iloc[-n_splits:].tolist()
for t in range(n_splits):
model = ARIMA(history, order=order, seasonal_order=seasonal_order)
model_fit = model.fit()
output = model_fit.forecast()
yhat = output[0]
predictions.append(yhat)
history.append(test_set[t])
mae = mean_absolute_error(test_set, predictions)
mse = mean_squared_error(test_set, predictions)
return predictions, mae, mse
# Perform rolling forecast for UNIT_SALES
predictions_unit, mae_unit, mse_unit = rolling_forecast_origin(data_unit_sales, order, seasonal_order, n_splits)
# Perform rolling forecast for DOLLAR_SALES
predictions_unit, mae_dollar, mse_dollar = rolling_forecast_origin(data_dollar_sales, order, seasonal_order, n_splits)
# Print the evaluation
print(f'ARIMA model MAE for UNIT_SALES: {mae_unit}')
print(f'ARIMA model MAE for DOLLAR_SALES: {mae_dollar}')
print(f'ARIMA model MSE for UNIT_SALES: {mse_unit}')
print(f'ARIMA model MSE for DOLLAR_SALES: {mse_dollar}')
ARIMA model MAE for UNIT_SALES: 60.95541428597327 ARIMA model MAE for DOLLAR_SALES: 66.25876265212688 ARIMA model MSE for UNIT_SALES: 3930.1638823873313 ARIMA model MSE for DOLLAR_SALES: 5237.383369388842
The MAE values have decreased when compared to the other models. So, this is quite the good model.
From the model the best model is ARIMA with low MAE and MSE values, we can say that the best 13 Weeks for unit sales are from 2023-11-05 to 2024-01-28 with total sales: 6254 and the best 13 Weeks for dollar sales are from 2023-10-29 to 2024-01-21 with total dollar sales: 8603.
Next, we analyze the sales of Non swire manufactures instead of the swire manufactures.
We will now try with filters on flavor 'Kiwano' with non-swire manufacturer, category 'Energy' and 'Diet/Light' Caloric Segment.
Before building the model to forecast the sales, let's check the data and find the similar products in the given dataset. Since, the dataset is large we use Google Big query to filter the data based on the requirements and import those filtered data to this notebook.
In the dataset provided to us we don't have dataset with combinations of Package Type: '16 liquid small'. So, we fist consider the other Caloric Segment: Diet, Market Category: energy, Manufacturer is not Swire-CC, and flavour : kiwano
from google.colab import auth
from google.cloud import bigquery
from google.colab import data_table
project = 'spring-swire-ca' # Project ID inserted based on the query results selected to explore
location = 'US' # Location inserted based on the query results selected to explore
client = bigquery.Client(project=project, location=location)
data_table.enable_dataframe_formatter()
auth.authenticate_user()
job = client.get_job('bquxjob_406e83f8_18e97393a1a') # Job ID inserted based on the query results selected to explore
print(job.query)
SELECT DATE,SUM(UNIT_SALES) AS UNIT_SALES, SUM(DOLLAR_SALES) AS DOLLAR_SALES FROM `swirecc.fact_market_demand` WHERE ITEM LIKE '%KIWANO%' AND MANUFACTURER != 'SWIRE-CC' AND CALORIC_SEGMENT = 'DIET/LIGHT' AND CATEGORY = 'ENERGY' GROUP BY DATE;
job = client.get_job('bquxjob_406e83f8_18e97393a1a') # Job ID inserted based on the query results selected to explore
results = job.to_dataframe()
results
| DATE | UNIT_SALES | DOLLAR_SALES | |
|---|---|---|---|
| 0 | 2021-09-11 | 123092.00 | 257065.41 |
| 1 | 2022-01-08 | 103403.00 | 221574.93 |
| 2 | 2022-08-06 | 111003.00 | 256998.64 |
| 3 | 2023-02-25 | 77950.00 | 189635.07 |
| 4 | 2022-08-20 | 104504.00 | 256971.18 |
| ... | ... | ... | ... |
| 134 | 2023-02-18 | 81871.00 | 201203.89 |
| 135 | 2022-06-25 | 110180.00 | 254196.28 |
| 136 | 2021-10-23 | 120394.00 | 242164.06 |
| 137 | 2022-07-09 | 122341.00 | 259899.63 |
| 138 | 2023-10-28 | 71646.85 | 164146.62 |
139 rows × 3 columns
We get the data from the google big query to this notebook and next we make changes to the dataframe 'result' by changing the 'DATE' format and dividing it into year, month and week.
import pandas as pd
# Converting 'DATE' column to datetime format
results['DATE'] = pd.to_datetime(results['DATE'])
# Extract relevant features for forecasting
forecast_features = results[['DATE', 'UNIT_SALES', 'DOLLAR_SALES']]
# Add additional time-related features
forecast_features['YEAR'] = forecast_features['DATE'].dt.year
forecast_features['MONTH'] = forecast_features['DATE'].dt.month
forecast_features['WEEK_OF_YEAR'] = forecast_features['DATE'].dt.isocalendar().week
# Display the prepared forecasting features
print("Forecasting Features:")
forecast_features
Forecasting Features:
| DATE | UNIT_SALES | DOLLAR_SALES | YEAR | MONTH | WEEK_OF_YEAR | |
|---|---|---|---|---|---|---|
| 0 | 2021-09-11 | 123092.00 | 257065.41 | 2021 | 9 | 36 |
| 1 | 2022-01-08 | 103403.00 | 221574.93 | 2022 | 1 | 1 |
| 2 | 2022-08-06 | 111003.00 | 256998.64 | 2022 | 8 | 31 |
| 3 | 2023-02-25 | 77950.00 | 189635.07 | 2023 | 2 | 8 |
| 4 | 2022-08-20 | 104504.00 | 256971.18 | 2022 | 8 | 33 |
| ... | ... | ... | ... | ... | ... | ... |
| 134 | 2023-02-18 | 81871.00 | 201203.89 | 2023 | 2 | 7 |
| 135 | 2022-06-25 | 110180.00 | 254196.28 | 2022 | 6 | 25 |
| 136 | 2021-10-23 | 120394.00 | 242164.06 | 2021 | 10 | 42 |
| 137 | 2022-07-09 | 122341.00 | 259899.63 | 2022 | 7 | 27 |
| 138 | 2023-10-28 | 71646.85 | 164146.62 | 2023 | 10 | 43 |
139 rows × 6 columns
We follow the same pattern for the import of the dataset from the google big query to this notebook and exporting year and month and week from the 'DATE' column throughout the notebook.
Exponential smoothing is a forecasting method that uses weighted averages of past observations to predict new values. It is most effective when the values of the time series follow a gradual trend and display seasonal behavior in which the values follow a repeated cyclical pattern over a given number of time steps.
import pandas as pd
import matplotlib.pyplot as plt
from statsmodels.tsa.holtwinters import ExponentialSmoothing
# Ensure the DATE column is in datetime format and set as the DataFrame's index
forecast_features['DATE'] = pd.to_datetime(forecast_features['DATE'])
forecast_features.set_index('DATE', inplace=True)
# Sort the DataFrame by the datetime index
forecast_features.sort_index(inplace=True)
# Define the last date in the DataFrame
last_date = forecast_features.index.max()
# Prepare the forecast index for the next year after the last date
forecast_index = pd.date_range(start=last_date, periods=53, freq='W-SUN')[1:]
# Exponential Smoothing Forecast for UNIT_SALES
exp_model = ExponentialSmoothing(
forecast_features['UNIT_SALES'],
trend='add',
seasonal='add',
seasonal_periods=52
).fit()
exp_forecast = exp_model.forecast(52)
exp_forecast.index = forecast_index
# Exponential Smoothing Forecast for DOLLAR_SALES
exp_model_dollar = ExponentialSmoothing(
forecast_features['DOLLAR_SALES'],
trend='add',
seasonal='add',
seasonal_periods=52
).fit()
exp_forecast_dollar = exp_model_dollar.forecast(52)
exp_forecast_dollar.index = forecast_index
# Function to find the best 13 weeks
def find_best_13_weeks(forecast):
rolling_sum = forecast.rolling(window=13, min_periods=1).sum()
best_period_end = rolling_sum.idxmax()
best_period_start = best_period_end - pd.DateOffset(weeks=12) # 13 weeks include the end week
return best_period_start, best_period_end
# Find the best 13 weeks for unit sales
best_start_unit, best_end_unit = find_best_13_weeks(exp_forecast)
# Find the best 13 weeks for dollar sales
best_start_dollar, best_end_dollar = find_best_13_weeks(exp_forecast_dollar)
# Plotting function
def plot_forecast_with_highlights(forecast, best_start, best_end, title):
plt.figure(figsize=(14, 7))
forecast = forecast.clip(lower=0) # Ensure no negative values in the forecast
plt.plot(forecast.index, forecast, label='Forecast')
plt.axvspan(best_start, best_end, color='orange', alpha=0.3, label='Best 13 Weeks')
plt.title(title)
plt.xlabel('Date')
plt.ylabel('Sales')
plt.legend()
plt.show()
# Plot the forecasts with the best thirteen weeks highlighted
plot_forecast_with_highlights(exp_forecast, best_start_unit, best_end_unit, 'Unit Sales Forecast with Best 13 Weeks Highlighted')
plot_forecast_with_highlights(exp_forecast_dollar, best_start_dollar, best_end_dollar, 'Dollar Sales Forecast with Best 13 Weeks Highlighted')
Here the best sales are from the November to January for both dollar and unit sales.
# Defining the function to find the best 13 weeks
def find_best_13_weeks(forecast):
rolling_sum = forecast.rolling(window=13, min_periods=1).sum()
best_period_end = rolling_sum.idxmax()
best_period_start = best_period_end - pd.DateOffset(weeks=12)
return best_period_start, best_period_end, rolling_sum.max()
# Find the best 13 weeks for unit sales
best_start_unit, best_end_unit, max_sales_unit = find_best_13_weeks(exp_forecast)
# Find the best 13 weeks for dollar sales
best_start_dollar, best_end_dollar, max_sales_dollar = find_best_13_weeks(exp_forecast_dollar)
# Output the best periods and total sales
print(f"Best 13 weeks for unit sales start on {best_start_unit.date()} and end on {best_end_unit.date()}, with total sales: {max_sales_unit}")
print(f"Best 13 weeks for dollar sales start on {best_start_dollar.date()} and end on {best_end_dollar.date()}, with total sales: {max_sales_dollar}")
# Since 'forecast_index' doesn't have the frequency set, let's define it to ensure we can perform the rolling operation.
exp_forecast.index.freq = 'W-SUN'
exp_forecast_dollar.index.freq = 'W-SUN'
# Now, let's find the values and months for the best 13 weeks
best_13_weeks_values_unit = exp_forecast.loc[best_start_unit:best_end_unit]
best_13_weeks_values_dollar = exp_forecast_dollar.loc[best_start_dollar:best_end_dollar]
# Print out the results
print("Best 13 weeks for Unit Sales:")
print(best_13_weeks_values_unit)
print("\nBest 13 weeks for Dollar Sales:")
print(best_13_weeks_values_dollar)
# Extracting the month names for visualization
best_months_unit = best_13_weeks_values_unit.index.month_name().unique()
best_months_dollar = best_13_weeks_values_dollar.index.month_name().unique()
print("\nBest months for Unit Sales within the 13-week period:")
print(best_months_unit)
print("\nBest months for Dollar Sales within the 13-week period:")
print(best_months_dollar)
Best 13 weeks for unit sales start on 2023-11-05 and end on 2024-01-28, with total sales: 640494.8662601131 Best 13 weeks for dollar sales start on 2023-11-05 and end on 2024-01-28, with total sales: 1313210.6307216804 Best 13 weeks for Unit Sales: 2023-11-05 69740.235538 2023-11-12 66293.906130 2023-11-19 52157.082579 2023-11-26 43255.367610 2023-12-03 45781.059616 2023-12-10 49516.423206 2023-12-17 51097.188408 2023-12-24 50896.884348 2023-12-31 47854.218585 2024-01-07 48371.876137 2024-01-14 36942.797598 2024-01-21 40319.882426 2024-01-28 38267.944079 Freq: W-SUN, dtype: float64 Best 13 weeks for Dollar Sales: 2023-11-05 157196.628812 2023-11-12 149462.811153 2023-11-19 116709.370115 2023-11-26 98405.742345 2023-12-03 96340.505357 2023-12-10 101052.849146 2023-12-17 106224.644560 2023-12-24 100832.364426 2023-12-31 92506.248174 2024-01-07 90096.539788 2024-01-14 72240.792266 2024-01-21 69553.290389 2024-01-28 62588.844191 Freq: W-SUN, dtype: float64 Best months for Unit Sales within the 13-week period: Index(['November', 'December', 'January'], dtype='object') Best months for Dollar Sales within the 13-week period: Index(['November', 'December', 'January'], dtype='object')
The total sales in this 13 weeks are 640494 units and dollar sales are 1313210 dollars.
Let's evaluate the model performance.
# Splitting the data into train and test sets
split_point = int(len(forecast_features) * 0.8) # for an 80/20 split
train = forecast_features.iloc[:split_point]
test = forecast_features.iloc[split_point:]
# Define the parameters for the Exponential Smoothing model
params = {'trend': 'add', 'seasonal': 'add', 'seasonal_periods': 52}
# Fit the model on the training set for UNIT_SALES
exp_model_unit = ExponentialSmoothing(train['UNIT_SALES'], **params).fit()
# Generate forecasts for the test set period
unit_sales_forecast = exp_model_unit.forecast(len(test))
# Calculate MAE and MSE for UNIT_SALES
mae_unit = mean_absolute_error(test['UNIT_SALES'], unit_sales_forecast)
mse_unit = mean_squared_error(test['UNIT_SALES'], unit_sales_forecast)
# Repeat the process for DOLLAR_SALES
exp_model_dollar = ExponentialSmoothing(train['DOLLAR_SALES'], **params).fit()
dollar_sales_forecast = exp_model_dollar.forecast(len(test))
mae_dollar = mean_absolute_error(test['DOLLAR_SALES'], dollar_sales_forecast)
mse_dollar = mean_squared_error(test['DOLLAR_SALES'], dollar_sales_forecast)
# Print the evaluation metrics
print(f'UNIT_SALES - MAE: {mae_unit}, MSE: {mse_unit}')
print(f'DOLLAR_SALES - MAE: {mae_dollar}, MSE: {mse_dollar}')
UNIT_SALES - MAE: 22372.4380005534, MSE: 672118208.9165068 DOLLAR_SALES - MAE: 68485.10466434849, MSE: 5900371668.366554
The MAE and MSE values are 68485 and 5900371668 for dollar sales and 22372 and 672118208 for the unit sales.
Prophet is a procedure for forecasting time series data based on an additive model where non-linear trends are fit with yearly, weekly, and daily seasonality, plus holiday effects. It works best with time series that have strong seasonal effects and several seasons of historical data. Prophet is robust to missing data and shifts in the trend, and typically handles outliers well.
import pandas as pd
from prophet import Prophet
import matplotlib.pyplot as plt
# Convert the 'DATE' column to datetime and sort by date
forecast_features['DATE'] = pd.to_datetime(forecast_features['DATE'])
forecast_features.sort_values('DATE', inplace=True)
# Prepare the DataFrame for Prophet's convention
df_prophet_unit = forecast_features[['DATE', 'UNIT_SALES']].rename(columns={'DATE': 'ds', 'UNIT_SALES': 'y'})
df_prophet_dollar = forecast_features[['DATE', 'DOLLAR_SALES']].rename(columns={'DATE': 'ds', 'DOLLAR_SALES': 'y'})
# Fit the Prophet model for unit sales
prophet_model_unit = Prophet(yearly_seasonality=True, weekly_seasonality=True)
prophet_model_unit.fit(df_prophet_unit)
# Fit the Prophet model for dollar sales
prophet_model_dollar = Prophet(yearly_seasonality=True, weekly_seasonality=True)
prophet_model_dollar.fit(df_prophet_dollar)
# Create a future dataframe for one year and make predictions
future_unit = prophet_model_unit.make_future_dataframe(periods=365)
future_dollar = prophet_model_dollar.make_future_dataframe(periods=365)
forecast_unit = prophet_model_unit.predict(future_unit)
forecast_dollar = prophet_model_dollar.predict(future_dollar)
# Get the last historical date
last_historical_date = df_prophet_unit['ds'].max()
# Function to find the best 13 weeks within the forecast period
def find_best_13_weeks(forecast, last_historical_date):
# Restrict to forecasted data after the last historical date
forecast_future = forecast[forecast['ds'] > last_historical_date]
forecast_future['rolling_sum'] = forecast_future['yhat'].rolling(window=13, min_periods=1).sum()
best_period_idx = forecast_future['rolling_sum'].idxmax()
best_period_start = forecast_future.loc[best_period_idx]['ds']
best_period_end = forecast_future.loc[best_period_idx]['ds'] + pd.DateOffset(weeks=12) # Include the end week
return best_period_start, best_period_end
# Find the best 13 weeks for unit sales and dollar sales
best_start_unit, best_end_unit = find_best_13_weeks(forecast_unit, last_historical_date)
best_start_dollar, best_end_dollar = find_best_13_weeks(forecast_dollar, last_historical_date)
# Plotting function
def plot_prophet_forecast_with_highlights(model, forecast, best_start, best_end, title):
fig = model.plot(forecast)
plt.axvspan(best_start, best_end, color='orange', alpha=0.3, label='Best 13 Weeks')
plt.title(title)
plt.xlabel('Date')
plt.ylabel('Sales')
plt.legend()
plt.show()
# Plot the forecasts with the best 13 weeks highlighted
plot_prophet_forecast_with_highlights(prophet_model_unit, forecast_unit, best_start_unit, best_end_unit, 'Prophet Forecast for Unit Sales with Best 13 Weeks Highlighted')
plot_prophet_forecast_with_highlights(prophet_model_dollar, forecast_dollar, best_start_dollar, best_end_dollar, 'Prophet Forecast for Dollar Sales with Best 13 Weeks Highlighted')
INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this. DEBUG:cmdstanpy:input tempfile: /tmp/tmpshbwn_60/5tbswkao.json DEBUG:cmdstanpy:input tempfile: /tmp/tmpshbwn_60/ss_ra2dt.json DEBUG:cmdstanpy:idx 0 DEBUG:cmdstanpy:running CmdStan, num_threads: None DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.10/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=30402', 'data', 'file=/tmp/tmpshbwn_60/5tbswkao.json', 'init=/tmp/tmpshbwn_60/ss_ra2dt.json', 'output', 'file=/tmp/tmpshbwn_60/prophet_modeln6c70_2i/prophet_model-20240401011549.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000'] 01:15:49 - cmdstanpy - INFO - Chain [1] start processing INFO:cmdstanpy:Chain [1] start processing 01:15:49 - cmdstanpy - INFO - Chain [1] done processing INFO:cmdstanpy:Chain [1] done processing INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this. DEBUG:cmdstanpy:input tempfile: /tmp/tmpshbwn_60/kjshcr6a.json DEBUG:cmdstanpy:input tempfile: /tmp/tmpshbwn_60/nrlfjbuz.json DEBUG:cmdstanpy:idx 0 DEBUG:cmdstanpy:running CmdStan, num_threads: None DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.10/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=47913', 'data', 'file=/tmp/tmpshbwn_60/kjshcr6a.json', 'init=/tmp/tmpshbwn_60/nrlfjbuz.json', 'output', 'file=/tmp/tmpshbwn_60/prophet_modelcectwu2v/prophet_model-20240401011549.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000'] 01:15:49 - cmdstanpy - INFO - Chain [1] start processing INFO:cmdstanpy:Chain [1] start processing 01:15:49 - cmdstanpy - INFO - Chain [1] done processing INFO:cmdstanpy:Chain [1] done processing <ipython-input-76-776aca8e2f12>:34: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy forecast_future['rolling_sum'] = forecast_future['yhat'].rolling(window=13, min_periods=1).sum() <ipython-input-76-776aca8e2f12>:34: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy forecast_future['rolling_sum'] = forecast_future['yhat'].rolling(window=13, min_periods=1).sum()
The best 13 weeks in which the unit sales and dollar sales are the highest in the months from november to january.
Lets evaluate model performance metrics
# Split the data into train and test sets
split_point = int(len(forecast_features) * 0.8) # for an 80/20 split
train = forecast_features.iloc[:split_point].copy() # Make a copy to avoid modifying the original DataFrame
test = forecast_features.iloc[split_point:].copy() # Make a copy to avoid modifying the original DataFrame
# Reset index to ensure 'DATE' is a regular column
train.reset_index(inplace=True)
test.reset_index(inplace=True)
train.rename(columns={'DATE': 'ds', 'UNIT_SALES': 'y'}, inplace=True)
# Fit the Prophet model for UNIT_SALES on the training set
prophet_model_unit = Prophet(yearly_seasonality=True, weekly_seasonality=True)
prophet_model_unit.fit(train[['ds', 'y']]) # Ensure 'ds' and 'y' columns are selected
# Generate forecasts for the test set period
future_unit = prophet_model_unit.make_future_dataframe(periods=len(test))
forecast_unit = prophet_model_unit.predict(future_unit)
# Calculate MAE and MSE for UNIT_SALES
mae_unit = mean_absolute_error(test['UNIT_SALES'], forecast_unit['yhat'][-len(test):])
mse_unit = mean_squared_error(test['UNIT_SALES'], forecast_unit['yhat'][-len(test):])
# Repeat the process for DOLLAR_SALES
prophet_model_dollar = Prophet(yearly_seasonality=True, weekly_seasonality=True)
prophet_model_dollar.fit(train[['ds', 'y']]) # Ensure 'ds' and 'y' columns are selected
future_dollar = prophet_model_dollar.make_future_dataframe(periods=len(test))
forecast_dollar = prophet_model_dollar.predict(future_dollar)
mae_dollar = mean_absolute_error(test['DOLLAR_SALES'], forecast_dollar['yhat'][-len(test):])
mse_dollar = mean_squared_error(test['DOLLAR_SALES'], forecast_dollar['yhat'][-len(test):])
INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this. DEBUG:cmdstanpy:input tempfile: /tmp/tmpshbwn_60/u9pjjil2.json DEBUG:cmdstanpy:input tempfile: /tmp/tmpshbwn_60/n580nfkw.json DEBUG:cmdstanpy:idx 0 DEBUG:cmdstanpy:running CmdStan, num_threads: None DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.10/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=37008', 'data', 'file=/tmp/tmpshbwn_60/u9pjjil2.json', 'init=/tmp/tmpshbwn_60/n580nfkw.json', 'output', 'file=/tmp/tmpshbwn_60/prophet_modeltxjq2ubn/prophet_model-20240401011555.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000'] 01:15:55 - cmdstanpy - INFO - Chain [1] start processing INFO:cmdstanpy:Chain [1] start processing 01:15:55 - cmdstanpy - INFO - Chain [1] done processing INFO:cmdstanpy:Chain [1] done processing INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this. DEBUG:cmdstanpy:input tempfile: /tmp/tmpshbwn_60/s50mji75.json DEBUG:cmdstanpy:input tempfile: /tmp/tmpshbwn_60/ehw7alic.json DEBUG:cmdstanpy:idx 0 DEBUG:cmdstanpy:running CmdStan, num_threads: None DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.10/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=14274', 'data', 'file=/tmp/tmpshbwn_60/s50mji75.json', 'init=/tmp/tmpshbwn_60/ehw7alic.json', 'output', 'file=/tmp/tmpshbwn_60/prophet_modelo2b1sxuj/prophet_model-20240401011555.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000'] 01:15:55 - cmdstanpy - INFO - Chain [1] start processing INFO:cmdstanpy:Chain [1] start processing 01:15:55 - cmdstanpy - INFO - Chain [1] done processing INFO:cmdstanpy:Chain [1] done processing
# Print the evaluation metrics
print(f'UNIT_SALES - MAE: {mae_unit}, MSE: {mse_unit}')
print(f'DOLLAR_SALES - MAE: {mae_dollar}, MSE: {mse_dollar}')
UNIT_SALES - MAE: 3904.5456265075886, MSE: 22270963.418062758 DOLLAR_SALES - MAE: 107859.06277483127, MSE: 11735122700.572842
The MAE and MSE values for unit sales are 3904 and 22270963 . For dollor sales the respected values are 107859 and 11735122700.
ARIMA stands for Autoregressive Integrated Moving Average. It's a popular and powerful time series forecasting technique used for modeling and predicting time series data. ARIMA models are particularly effective for stationary time series data, meaning the statistical properties of the series such as mean and variance are constant over time.
import pandas as pd
import matplotlib.pyplot as plt
# Ensure the DATE column is in datetime format and set as the DataFrame's index
forecast_features['DATE'] = pd.to_datetime(forecast_features['DATE'])
forecast_features.set_index('DATE', inplace=True)
# Sort the DataFrame by the datetime index just to be sure
forecast_features.sort_index(inplace=True)
# Define the ARIMA model for UNIT_SALES
arima_model_unit = ARIMA(forecast_features['UNIT_SALES'], order=(1, 1, 52))
arima_result_unit = arima_model_unit.fit()
# Define the ARIMA model for DOLLAR_SALES
arima_model_dollar = ARIMA(forecast_features['DOLLAR_SALES'], order=(1, 1, 52))
arima_result_dollar = arima_model_dollar.fit()
# Define the date range for the next year after the last date in the dataset
last_date = forecast_features.index.max()
forecast_dates = pd.date_range(start=last_date + pd.Timedelta(days=1), periods=52, freq='W')
# Forecast the next 52 periods (assuming weekly data)
arima_forecast_unit = arima_result_unit.get_forecast(steps=52).predicted_mean
arima_forecast_dollar = arima_result_dollar.get_forecast(steps=52).predicted_mean
# Convert forecasts to pandas Series with a DateTimeIndex
arima_forecast_unit = pd.Series(arima_forecast_unit.values, index=forecast_dates)
arima_forecast_dollar = pd.Series(arima_forecast_dollar.values, index=forecast_dates)
# Check if rolling sum calculation is possible
rolling_sum = arima_forecast_unit.rolling(window=13, min_periods=1).sum()
best_period_end = rolling_sum.idxmax()
if pd.notnull(best_period_end):
best_period_start = best_period_end - pd.DateOffset(weeks=12)
# Plot ARIMA forecast with the best 13 weeks highlighted
plt.figure(figsize=(10, 6))
plt.plot(forecast_features.index, forecast_features['UNIT_SALES'], label='Actual Unit Sales', color='blue')
plt.plot(arima_forecast_unit.index, arima_forecast_unit, label='ARIMA Forecast', color='red')
plt.axvspan(best_period_start, best_period_end, color='yellow', alpha=0.3, label='Best 13 Weeks')
plt.title('ARIMA Forecast for Unit Sales with Best 13 Weeks Highlighted')
plt.xlabel('Date')
plt.ylabel('Unit Sales')
plt.legend()
plt.show()
else:
print("No best period is there")
The best 13 weeks in which the unit sales are the highest in the months from november to january.
import pandas as pd
import matplotlib.pyplot as plt
# Define the ARIMA model for DOLLAR_SALES
arima_model_dollar = ARIMA(forecast_features['DOLLAR_SALES'], order=(1, 1, 1), seasonal_order=(1, 1, 1, 52))
arima_result_dollar = arima_model_dollar.fit()
# Forecast the next 52 periods (assuming weekly data)
arima_forecast_dollar = arima_result_dollar.get_forecast(steps=52).predicted_mean
# Convert forecast to pandas Series with a DateTimeIndex
forecast_dates = pd.date_range(start=last_date + pd.Timedelta(days=1), periods=52, freq='W')
arima_forecast_dollar = pd.Series(arima_forecast_dollar.values, index=forecast_dates)
# Calculate the rolling sum over 13-week periods to find the best period for dollar sales
rolling_sum_dollar = arima_forecast_dollar.rolling(window=13, min_periods=1).sum()
best_period_end_dollar = rolling_sum_dollar.idxmax()
if pd.notnull(best_period_end_dollar):
best_period_start_dollar = best_period_end_dollar - pd.DateOffset(weeks=12)
# Plot the ARIMA forecast for DOLLAR_SALES with the best 13 weeks highlighted
plt.figure(figsize=(10, 6))
plt.plot(forecast_features.index, forecast_features['DOLLAR_SALES'], label='Actual Dollar Sales', color='blue')
plt.plot(arima_forecast_dollar.index, arima_forecast_dollar, label='ARIMA Forecast', color='red')
plt.axvspan(best_period_start_dollar, best_period_end_dollar, color='yellow', alpha=0.3, label='Best 13 Weeks')
plt.title('ARIMA Forecast for Dollar Sales with Best 13 Weeks Highlighted')
plt.xlabel('Date')
plt.ylabel('Dollar Sales')
plt.legend()
plt.show()
else:
print("No best period is there")
The best 13 weeks in which the dollar sales are the highest in the months from november to january.
import pandas as pd
# Function to calculate the best 13-week period for any given forecast series
def calculate_best_13_weeks(forecast_series):
rolling_sum = forecast_series.rolling(window=13, min_periods=1).sum()
max_sum_index = rolling_sum.idxmax()
max_sum_value = rolling_sum.max()
start_of_best_period = max_sum_index - pd.DateOffset(weeks=12) # 13 weeks including the end week
return start_of_best_period, max_sum_index, max_sum_value
# Calculate for Unit Sales
best_start_unit, best_end_unit, best_sales_unit = calculate_best_13_weeks(arima_forecast_unit)
print(f"Best 13 Weeks for Unit Sales: {best_start_unit.date()} to {best_end_unit.date()}, Total Sales: {best_sales_unit}")
# Calculate for Dollar Sales
best_start_dollar, best_end_dollar, best_sales_dollar = calculate_best_13_weeks(arima_forecast_dollar)
print(f"Best 13 Weeks for Dollar Sales: {best_start_dollar.date()} to {best_end_dollar.date()}, Total Sales: {best_sales_dollar}")
Best 13 Weeks for Unit Sales: 2023-10-29 to 2024-01-21, Total Sales: 935832.507085375 Best 13 Weeks for Dollar Sales: 2023-10-29 to 2024-01-21, Total Sales: 2148825.242396435
from statsmodels.tsa.arima.model import ARIMA
from sklearn.metrics import mean_squared_error, mean_absolute_error
import numpy as np
# Assuming forecast_features is your dataframe with a datetime index and UNIT_SALES and DOLLAR_SALES columns
data_unit_sales = forecast_features['UNIT_SALES']
data_dollar_sales = forecast_features['DOLLAR_SALES']
# Number of observations to leave out in each split for testing
n_splits = 5
# The order and seasonal order for ARIMA/SARIMA model
order = (1, 1, 1)
seasonal_order = (1, 1, 1, 52)
# Perform rolling forecast origin for unit sales
def rolling_forecast_origin(time_series, order, seasonal_order, n_splits):
history = time_series.iloc[:-n_splits].tolist()
predictions = []
test_set = time_series.iloc[-n_splits:].tolist()
for t in range(n_splits):
model = ARIMA(history, order=order, seasonal_order=seasonal_order)
model_fit = model.fit()
output = model_fit.forecast()
yhat = output[0]
predictions.append(yhat)
history.append(test_set[t])
mae = mean_absolute_error(test_set, predictions)
mse = mean_squared_error(test_set, predictions)
return predictions, mae, mse
# Perform rolling forecast for UNIT_SALES
predictions_unit, mae_unit, mse_unit = rolling_forecast_origin(data_unit_sales, order, seasonal_order, n_splits)
# Perform rolling forecast for DOLLAR_SALES
predictions_unit, mae_dollar, mse_dollar = rolling_forecast_origin(data_dollar_sales, order, seasonal_order, n_splits)
# Print the evaluation
print(f'ARIMA model MAE for UNIT_SALES: {mae_unit}')
print(f'ARIMA model MAE for DOLLAR_SALES: {mae_dollar}')
print(f'ARIMA model MSE for UNIT_SALES: {mse_unit}')
print(f'ARIMA model MSE for DOLLAR_SALES: {mse_dollar}')
ARIMA model MAE for UNIT_SALES: 4932.447870350076 ARIMA model MAE for DOLLAR_SALES: 5904.8374217014525 ARIMA model MSE for UNIT_SALES: 32051026.15001712 ARIMA model MSE for DOLLAR_SALES: 60474279.28900906
The MAE values have decreased for both unit sales and dollor sales when compared to the other models. So, this is quite the good model.
From the models we can say that the best model is ARIMA model with best 13 weeks for unit sales from 2023-10-29 to 2024-01-21 with total sales: 935832. and best 13 Weeks for dollar sales from 2023-10-29 to 2024-01-21 with total dollar sales of 2148825. All the models gives the best 13 weeks are from November to January.
Item Description: Peppy Gentle Drink Pink Woodsy .5L Multi Jug
Caloric Segment: Regular
Type: SSD
Manufacturer: Swire-CC
Brand: Peppy
Package Type: .5L Multi Jug
Flavor: ‘Pink Woodsy’
Swire plans to release this product in the Southern region for 13 weeks.
What will the forecasted demand be, in weeks, for this product?
We first filter the brand 'Peppy' with 'Swire-cc', category 'SSD' and 'Regular' caloric segment in Southern Regions like KS, UT, CA, CO, AZ, NM, NV
Before building the model to forecast the sales, let's check the data and find the similar products in the given dataset. Since, the dataset is large we use Google Big query to filter the data based on the requirements and import those filtered data to this notebook.
In the dataset provided to us we don't have dataset with combinations of Flavor: 'Pink Woodsy'. So, we fist consider the other Caloric Segment: Regular, Market Category: SSD, Manufacturer: Swire-CC, and Brand:Peppy.
from google.colab import auth
from google.cloud import bigquery
from google.colab import data_table
project = 'spring-swire-ca' # Project ID inserted based on the query results selected to explore
location = 'US' # Location inserted based on the query results selected to explore
client = bigquery.Client(project=project, location=location)
data_table.enable_dataframe_formatter()
auth.authenticate_user()
job = client.get_job('bquxjob_2dbeaf40_18e972e90ec') # Job ID inserted based on the query results selected to explore
print(job.query)
SELECT
fmd.DATE,
SUM(fmd.UNIT_SALES) AS UNIT_SALES,
SUM(fmd.DOLLAR_SALES) AS DOLLAR_SALES
FROM `swirecc.fact_market_demand` fmd
JOIN (
SELECT DISTINCT zm.MARKET_KEY
FROM `swirecc.zip_to_market_unit_mapping` zm
LEFT JOIN `swirecc.consumer_demographics` cd
ON cd.Zip = zm.ZIP_CODE
WHERE cd.State IN ('KS', 'UT', 'CA', 'CO', 'AZ', 'NM', 'NV')
) AS distinct_market_keys
ON fmd.MARKET_KEY = distinct_market_keys.MARKET_KEY
WHERE fmd.CALORIC_SEGMENT = 'REGULAR'
AND fmd.CATEGORY = 'SSD'
AND fmd.BRAND = 'PEPPY'
AND fmd.MANUFACTURER = 'SWIRE-CC'
GROUP BY
fmd.DATE;
# Running this code will read results from your previous job
job = client.get_job('bquxjob_2dbeaf40_18e972e90ec') # Job ID inserted based on the query results selected to explore
results = job.to_dataframe()
results
| DATE | UNIT_SALES | DOLLAR_SALES | |
|---|---|---|---|
| 0 | 2023-01-14 | 1406159.0 | 6063254.22 |
| 1 | 2023-06-17 | 1415479.0 | 6217725.66 |
| 2 | 2021-09-18 | 1504042.0 | 5152927.93 |
| 3 | 2021-06-05 | 1593802.0 | 5142980.14 |
| 4 | 2021-11-20 | 1531359.0 | 5326959.38 |
| ... | ... | ... | ... |
| 142 | 2021-04-17 | 1425183.0 | 4636863.16 |
| 143 | 2021-04-10 | 1565628.0 | 4975948.41 |
| 144 | 2021-01-23 | 1457032.0 | 4529955.38 |
| 145 | 2022-01-22 | 1411644.0 | 5062541.25 |
| 146 | 2022-12-31 | 1496171.0 | 6029405.19 |
147 rows × 3 columns
We get the data from the google big query to this notebook and next we make changes to the dataframe 'result' by changing the 'DATE' format and dividing it into year, month and week.
# Converting 'DATE' column to datetime format
results['DATE'] = pd.to_datetime(results['DATE'])
# Extract relevant features for forecasting
forecast_features = results[['DATE', 'UNIT_SALES', 'DOLLAR_SALES']]
# Add additional time-related features
forecast_features['YEAR'] = forecast_features['DATE'].dt.year
forecast_features['MONTH'] = forecast_features['DATE'].dt.month
forecast_features['WEEK_OF_YEAR'] = forecast_features['DATE'].dt.isocalendar().week
# Display the prepared forecasting features
print("Forecasting Features:")
forecast_features
Forecasting Features:
| DATE | UNIT_SALES | DOLLAR_SALES | YEAR | MONTH | WEEK_OF_YEAR | |
|---|---|---|---|---|---|---|
| 0 | 2023-01-14 | 1406159.0 | 6063254.22 | 2023 | 1 | 2 |
| 1 | 2023-06-17 | 1415479.0 | 6217725.66 | 2023 | 6 | 24 |
| 2 | 2021-09-18 | 1504042.0 | 5152927.93 | 2021 | 9 | 37 |
| 3 | 2021-06-05 | 1593802.0 | 5142980.14 | 2021 | 6 | 22 |
| 4 | 2021-11-20 | 1531359.0 | 5326959.38 | 2021 | 11 | 46 |
| ... | ... | ... | ... | ... | ... | ... |
| 142 | 2021-04-17 | 1425183.0 | 4636863.16 | 2021 | 4 | 15 |
| 143 | 2021-04-10 | 1565628.0 | 4975948.41 | 2021 | 4 | 14 |
| 144 | 2021-01-23 | 1457032.0 | 4529955.38 | 2021 | 1 | 3 |
| 145 | 2022-01-22 | 1411644.0 | 5062541.25 | 2022 | 1 | 3 |
| 146 | 2022-12-31 | 1496171.0 | 6029405.19 | 2022 | 12 | 52 |
147 rows × 6 columns
We follow the same pattern for the import of the dataset from the google big query to this notebook and exporting year and month and week from the 'DATE' column throughout the notebook.
Prophet is a procedure for forecasting time series data based on an additive model where non-linear trends are fit with yearly, weekly, and daily seasonality, plus holiday effects. It works best with time series that have strong seasonal effects and several seasons of historical data. Prophet is robust to missing data and shifts in the trend, and typically handles outliers well.
import pandas as pd
from prophet import Prophet
import matplotlib.pyplot as plt
# Converting the 'DATE' column to datetime and sort by date
forecast_features['DATE'] = pd.to_datetime(forecast_features['DATE'])
forecast_features.sort_values('DATE', inplace=True)
# Prepare the DataFrame for Prophet's convention
df_prophet_unit = forecast_features[['DATE', 'UNIT_SALES']].rename(columns={'DATE': 'ds', 'UNIT_SALES': 'y'})
df_prophet_dollar = forecast_features[['DATE', 'DOLLAR_SALES']].rename(columns={'DATE': 'ds', 'DOLLAR_SALES': 'y'})
# Fit the Prophet model for unit sales
prophet_model_unit = Prophet(yearly_seasonality=True, weekly_seasonality=True)
prophet_model_unit.fit(df_prophet_unit)
# Fit the Prophet model for dollar sales
prophet_model_dollar = Prophet(yearly_seasonality=True, weekly_seasonality=True)
prophet_model_dollar.fit(df_prophet_dollar)
# Create a future dataframe for one year and make predictions
future_unit = prophet_model_unit.make_future_dataframe(periods=365)
future_dollar = prophet_model_dollar.make_future_dataframe(periods=365)
forecast_unit = prophet_model_unit.predict(future_unit)
forecast_dollar = prophet_model_dollar.predict(future_dollar)
# Get the last historical date
last_historical_date = df_prophet_unit['ds'].max()
# Function to find the best 13 weeks within the forecast period
def find_best_13_weeks(forecast, last_historical_date):
# Restrict to forecasted data after the last historical date
forecast_future = forecast[forecast['ds'] > last_historical_date]
forecast_future['rolling_sum'] = forecast_future['yhat'].rolling(window=13, min_periods=1).sum()
best_period_idx = forecast_future['rolling_sum'].idxmax()
best_period_start = forecast_future.loc[best_period_idx]['ds']
best_period_end = forecast_future.loc[best_period_idx]['ds'] + pd.DateOffset(weeks=12) # Include the end week
return best_period_start, best_period_end
# Find the best 13 weeks for unit sales and dollar sales
best_start_unit, best_end_unit = find_best_13_weeks(forecast_unit, last_historical_date)
best_start_dollar, best_end_dollar = find_best_13_weeks(forecast_dollar, last_historical_date)
# Plotting function
def plot_prophet_forecast_with_highlights(model, forecast, best_start, best_end, title):
fig = model.plot(forecast)
plt.axvspan(best_start, best_end, color='orange', alpha=0.3, label='Best 13 Weeks')
plt.title(title)
plt.xlabel('Date')
plt.ylabel('Sales')
plt.legend()
plt.show()
# Plot the forecasts with the best 13 weeks highlighted
plot_prophet_forecast_with_highlights(prophet_model_unit, forecast_unit, best_start_unit, best_end_unit, 'Prophet Forecast for Unit Sales with Best 13 Weeks Highlighted')
plot_prophet_forecast_with_highlights(prophet_model_dollar, forecast_dollar, best_start_dollar, best_end_dollar, 'Prophet Forecast for Dollar Sales with Best 13 Weeks Highlighted')
INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this. DEBUG:cmdstanpy:input tempfile: /tmp/tmpshbwn_60/xwj2ohjc.json DEBUG:cmdstanpy:input tempfile: /tmp/tmpshbwn_60/20yokwn4.json DEBUG:cmdstanpy:idx 0 DEBUG:cmdstanpy:running CmdStan, num_threads: None DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.10/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=45361', 'data', 'file=/tmp/tmpshbwn_60/xwj2ohjc.json', 'init=/tmp/tmpshbwn_60/20yokwn4.json', 'output', 'file=/tmp/tmpshbwn_60/prophet_modeljo0tht3g/prophet_model-20240401010402.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000'] 01:04:02 - cmdstanpy - INFO - Chain [1] start processing INFO:cmdstanpy:Chain [1] start processing 01:04:02 - cmdstanpy - INFO - Chain [1] done processing INFO:cmdstanpy:Chain [1] done processing INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this. DEBUG:cmdstanpy:input tempfile: /tmp/tmpshbwn_60/tl0v_pbb.json DEBUG:cmdstanpy:input tempfile: /tmp/tmpshbwn_60/gl708g9i.json DEBUG:cmdstanpy:idx 0 DEBUG:cmdstanpy:running CmdStan, num_threads: None DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.10/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=1343', 'data', 'file=/tmp/tmpshbwn_60/tl0v_pbb.json', 'init=/tmp/tmpshbwn_60/gl708g9i.json', 'output', 'file=/tmp/tmpshbwn_60/prophet_model8sdz05iz/prophet_model-20240401010402.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000'] 01:04:02 - cmdstanpy - INFO - Chain [1] start processing INFO:cmdstanpy:Chain [1] start processing 01:04:02 - cmdstanpy - INFO - Chain [1] done processing INFO:cmdstanpy:Chain [1] done processing <ipython-input-52-159f9d2a9899>:34: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy forecast_future['rolling_sum'] = forecast_future['yhat'].rolling(window=13, min_periods=1).sum() <ipython-input-52-159f9d2a9899>:34: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy forecast_future['rolling_sum'] = forecast_future['yhat'].rolling(window=13, min_periods=1).sum()
From this plot we can see that the best 13 weeks for unit sales are from january to march and dollor sales from october to december.
Now let's evaluate the model performance using the MAE and MSE
# Splitting the data into train and test sets
split_point = int(len(forecast_features) * 0.8) # for an 80/20 split
train = forecast_features.iloc[:split_point].copy() # Make a copy to avoid modifying the original DataFrame
test = forecast_features.iloc[split_point:].copy() # Make a copy to avoid modifying the original DataFrame
# Resetting index to ensure 'DATE' is a regular column
train.reset_index(inplace=True)
test.reset_index(inplace=True)
train.rename(columns={'DATE': 'ds', 'UNIT_SALES': 'y'}, inplace=True)
# Fit the Prophet model for UNIT_SALES on the training set
prophet_model_unit = Prophet(yearly_seasonality=True, weekly_seasonality=True)
prophet_model_unit.fit(train[['ds', 'y']]) # Ensure 'ds' and 'y' columns are selected
# Generate forecasts for the test set period
future_unit = prophet_model_unit.make_future_dataframe(periods=len(test))
forecast_unit = prophet_model_unit.predict(future_unit)
# Calculate MAE and MSE for UNIT_SALES
mae_unit = mean_absolute_error(test['UNIT_SALES'], forecast_unit['yhat'][-len(test):])
mse_unit = mean_squared_error(test['UNIT_SALES'], forecast_unit['yhat'][-len(test):])
# Repeat the process for DOLLAR_SALES
prophet_model_dollar = Prophet(yearly_seasonality=True, weekly_seasonality=True)
prophet_model_dollar.fit(train[['ds', 'y']]) # Ensure 'ds' and 'y' columns are selected
future_dollar = prophet_model_dollar.make_future_dataframe(periods=len(test))
forecast_dollar = prophet_model_dollar.predict(future_dollar)
mae_dollar = mean_absolute_error(test['DOLLAR_SALES'], forecast_dollar['yhat'][-len(test):])
mse_dollar = mean_squared_error(test['DOLLAR_SALES'], forecast_dollar['yhat'][-len(test):])
INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this. DEBUG:cmdstanpy:input tempfile: /tmp/tmpshbwn_60/a8jgy03y.json DEBUG:cmdstanpy:input tempfile: /tmp/tmpshbwn_60/mtrne_97.json DEBUG:cmdstanpy:idx 0 DEBUG:cmdstanpy:running CmdStan, num_threads: None DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.10/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=22758', 'data', 'file=/tmp/tmpshbwn_60/a8jgy03y.json', 'init=/tmp/tmpshbwn_60/mtrne_97.json', 'output', 'file=/tmp/tmpshbwn_60/prophet_modell85eiptr/prophet_model-20240401010413.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000'] 01:04:13 - cmdstanpy - INFO - Chain [1] start processing INFO:cmdstanpy:Chain [1] start processing 01:04:13 - cmdstanpy - INFO - Chain [1] done processing INFO:cmdstanpy:Chain [1] done processing INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this. DEBUG:cmdstanpy:input tempfile: /tmp/tmpshbwn_60/japwuo25.json DEBUG:cmdstanpy:input tempfile: /tmp/tmpshbwn_60/e6i2ij_4.json DEBUG:cmdstanpy:idx 0 DEBUG:cmdstanpy:running CmdStan, num_threads: None DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.10/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=45952', 'data', 'file=/tmp/tmpshbwn_60/japwuo25.json', 'init=/tmp/tmpshbwn_60/e6i2ij_4.json', 'output', 'file=/tmp/tmpshbwn_60/prophet_model91ohkqn9/prophet_model-20240401010414.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000'] 01:04:14 - cmdstanpy - INFO - Chain [1] start processing INFO:cmdstanpy:Chain [1] start processing 01:04:14 - cmdstanpy - INFO - Chain [1] done processing INFO:cmdstanpy:Chain [1] done processing
# Printing the evaluation metrics
print(f'UNIT_SALES - MAE: {mae_unit}, MSE: {mse_unit}')
print(f'DOLLAR_SALES - MAE: {mae_dollar}, MSE: {mse_dollar}')
UNIT_SALES - MAE: 84142.2759724092, MSE: 11464253863.828047 DOLLAR_SALES - MAE: 4927800.6610338185, MSE: 24365389117751.87
The MAE and MSE values for unit sales are 84142 and 11464253863.For dollor sales the respected values are 4927800 and 24365389117751.
Exponential smoothing is a forecasting method that uses weighted averages of past observations to predict new values. It is most effective when the values of the time series follow a gradual trend and display seasonal behavior in which the values follow a repeated cyclical pattern over a given number of time steps.
import pandas as pd
import matplotlib.pyplot as plt
from statsmodels.tsa.holtwinters import ExponentialSmoothing
# Ensuring the DATE column is in datetime format and set as the DataFrame's index
forecast_features['DATE'] = pd.to_datetime(forecast_features['DATE'])
forecast_features.set_index('DATE', inplace=True)
# Sorting the DataFrame by the datetime index
forecast_features.sort_index(inplace=True)
# Defining the last date in the DataFrame
last_date = forecast_features.index.max()
# Prepare the forecast index for the next year after the last date
forecast_index = pd.date_range(start=last_date, periods=53, freq='W-SUN')[1:]
# Exponential Smoothing Forecast for UNIT_SALES
exp_model = ExponentialSmoothing(
forecast_features['UNIT_SALES'],
trend='add',
seasonal='add',
seasonal_periods=52
).fit()
exp_forecast = exp_model.forecast(52)
exp_forecast.index = forecast_index
# Exponential Smoothing Forecast for DOLLAR_SALES
exp_model_dollar = ExponentialSmoothing(
forecast_features['DOLLAR_SALES'],
trend='add',
seasonal='add',
seasonal_periods=52
).fit()
exp_forecast_dollar = exp_model_dollar.forecast(52)
exp_forecast_dollar.index = forecast_index
# Function to find the best 13 weeks
def find_best_13_weeks(forecast):
rolling_sum = forecast.rolling(window=13, min_periods=1).sum()
best_period_end = rolling_sum.idxmax()
best_period_start = best_period_end - pd.DateOffset(weeks=12) # 13 weeks include the end week
return best_period_start, best_period_end
# Find the best 13 weeks for unit sales
best_start_unit, best_end_unit = find_best_13_weeks(exp_forecast)
# Find the best 13 weeks for dollar sales
best_start_dollar, best_end_dollar = find_best_13_weeks(exp_forecast_dollar)
# Plotting function
def plot_forecast_with_highlights(forecast, best_start, best_end, title):
plt.figure(figsize=(14, 7))
forecast = forecast.clip(lower=0) # Ensure no negative values in the forecast
plt.plot(forecast.index, forecast, label='Forecast')
plt.axvspan(best_start, best_end, color='orange', alpha=0.3, label='Best 13 Weeks')
plt.title(title)
plt.xlabel('Date')
plt.ylabel('Sales')
plt.legend()
plt.show()
# Plot the forecasts with the best thirteen weeks highlighted
plot_forecast_with_highlights(exp_forecast, best_start_unit, best_end_unit, 'Unit Sales Forecast with Best 13 Weeks Highlighted')
plot_forecast_with_highlights(exp_forecast_dollar, best_start_dollar, best_end_dollar, 'Dollar Sales Forecast with Best 13 Weeks Highlighted')
From the plot we can see that the best 13 weeks for unit sales are from november to February and dollor sales from august to october.
# Defining the function to find the best 13 weeks
def find_best_13_weeks(forecast):
rolling_sum = forecast.rolling(window=13, min_periods=1).sum()
best_period_end = rolling_sum.idxmax()
best_period_start = best_period_end - pd.DateOffset(weeks=12)
return best_period_start, best_period_end, rolling_sum.max()
# Find the best 13 weeks for unit sales
best_start_unit, best_end_unit, max_sales_unit = find_best_13_weeks(exp_forecast)
# Find the best 13 weeks for dollar sales
best_start_dollar, best_end_dollar, max_sales_dollar = find_best_13_weeks(exp_forecast_dollar)
# Output the best periods and total sales
print(f"Best 13 weeks for unit sales start on {best_start_unit.date()} and end on {best_end_unit.date()}, with total sales: {max_sales_unit}")
print(f"Best 13 weeks for dollar sales start on {best_start_dollar.date()} and end on {best_end_dollar.date()}, with total sales: {max_sales_dollar}")
exp_forecast.index.freq = 'W-SUN'
exp_forecast_dollar.index.freq = 'W-SUN'
# Now, let's find the values and months for the best 13 weeks
best_13_weeks_values_unit = exp_forecast.loc[best_start_unit:best_end_unit]
best_13_weeks_values_dollar = exp_forecast_dollar.loc[best_start_dollar:best_end_dollar]
# Print out the results
print("Best 13 weeks for Unit Sales:")
print(best_13_weeks_values_unit)
print("\nBest 13 weeks for Dollar Sales:")
print(best_13_weeks_values_dollar)
# Extracting the month names for visualization
best_months_unit = best_13_weeks_values_unit.index.month_name().unique()
best_months_dollar = best_13_weeks_values_dollar.index.month_name().unique()
print("\nBest months for Unit Sales within the 13-week period:")
print(best_months_unit)
print("\nBest months for Dollar Sales within the 13-week period:")
print(best_months_dollar)
Best 13 weeks for unit sales start on 2023-11-12 and end on 2024-02-04, with total sales: 18799739.97899931 Best 13 weeks for dollar sales start on 2024-08-04 and end on 2024-10-27, with total sales: 88970078.00788526 Best 13 weeks for Unit Sales: 2023-11-12 1.429675e+06 2023-11-19 1.406215e+06 2023-11-26 1.337633e+06 2023-12-03 1.413769e+06 2023-12-10 1.413843e+06 2023-12-17 1.402463e+06 2023-12-24 1.442195e+06 2023-12-31 1.635844e+06 2024-01-07 1.344716e+06 2024-01-14 1.385444e+06 2024-01-21 1.440308e+06 2024-01-28 1.698508e+06 2024-02-04 1.449127e+06 Freq: W-SUN, dtype: float64 Best 13 weeks for Dollar Sales: 2024-08-04 6.950594e+06 2024-08-11 7.074054e+06 2024-08-18 6.635315e+06 2024-08-25 6.619353e+06 2024-09-01 6.610059e+06 2024-09-08 6.866880e+06 2024-09-15 6.991518e+06 2024-09-22 6.732203e+06 2024-09-29 6.720961e+06 2024-10-06 7.104676e+06 2024-10-13 7.116964e+06 2024-10-20 6.852696e+06 2024-10-27 6.694806e+06 Freq: W-SUN, dtype: float64 Best months for Unit Sales within the 13-week period: Index(['November', 'December', 'January', 'February'], dtype='object') Best months for Dollar Sales within the 13-week period: Index(['August', 'September', 'October'], dtype='object')
The total unit sales of these products in these 13 weeks are 18799739 and the dollar sales are 88970078.
Lets evaluate the model performance metrics
# Split the data into train and test sets
split_point = int(len(forecast_features) * 0.8) # for an 80/20 split
train = forecast_features.iloc[:split_point]
test = forecast_features.iloc[split_point:]
# Define the parameters for the Exponential Smoothing model
params = {'trend': 'add', 'seasonal': 'add', 'seasonal_periods': 52}
# Fit the model on the training set for UNIT_SALES
exp_model_unit = ExponentialSmoothing(train['UNIT_SALES'], **params).fit()
# Generate forecasts for the test set period
unit_sales_forecast = exp_model_unit.forecast(len(test))
# Calculate MAE and MSE for UNIT_SALES
mae_unit = mean_absolute_error(test['UNIT_SALES'], unit_sales_forecast)
mse_unit = mean_squared_error(test['UNIT_SALES'], unit_sales_forecast)
# Repeat the process for DOLLAR_SALES
exp_model_dollar = ExponentialSmoothing(train['DOLLAR_SALES'], **params).fit()
dollar_sales_forecast = exp_model_dollar.forecast(len(test))
mae_dollar = mean_absolute_error(test['DOLLAR_SALES'], dollar_sales_forecast)
mse_dollar = mean_squared_error(test['DOLLAR_SALES'], dollar_sales_forecast)
# Print the evaluation metrics
print(f'UNIT_SALES - MAE: {mae_unit}, MSE: {mse_unit}')
print(f'DOLLAR_SALES - MAE: {mae_dollar}, MSE: {mse_dollar}')
UNIT_SALES - MAE: 47282.503481257125, MSE: 3515332106.089926 DOLLAR_SALES - MAE: 213421.68449667966, MSE: 64885923395.17784
The MAE and MSE values for unit sales are 47282 and 3515332106 and the respected values for dollor sales are 213421 and 64885923395
The MAE values are decreased when compared to other models.So This is a quite good model.
The best 13 weeks for the products in the Southern region for unit sales start on 2023-11-12 and end on 2024-02-04, with total sales: 18799739. and the best 13 weeks for dollar sales start on 2024-08-04 and end on 2024-10-27, with total revenue: 88970078.
Next, we analyze the package type of .5L Multi Jug with the swire in southern region.
We first filter the package '.5L Multi Jug' with 'Swire-cc', category 'SSD' and 'Regular' caloric segment in Southern Regions like KS, UT, CA, CO, AZ, NM, NV
Before building the model to forecast the sales, let's check the data and find the similar products in the given dataset. Since, the dataset is large we use Google Big query to filter the data based on the requirements and import those filtered data to this notebook.
In the dataset provided to us we don't have dataset with combinations of Flavor: 'Pink Woodsy'. So, we fist consider the other Package : '.5L multijug' Caloric Segment: Regular, Market Category: SSD, Manufacturer: Swire-CC, and Brand:Peppy.
from google.colab import auth
from google.cloud import bigquery
from google.colab import data_table
project = 'spring-swire-ca' # Project ID inserted based on the query results selected to explore
location = 'US' # Location inserted based on the query results selected to explore
client = bigquery.Client(project=project, location=location)
data_table.enable_dataframe_formatter()
auth.authenticate_user()
# Running this code will display the query used to generate your previous job
job = client.get_job('bquxjob_4b74ac33_18e97221c14') # Job ID inserted based on the query results selected to explore
print(job.query)
SELECT fmd.DATE,SUM(fmd.UNIT_SALES) AS UNIT_SALES, SUM(fmd.DOLLAR_SALES) AS DOLLAR_SALES
FROM `swirecc.fact_market_demand` fmd
JOIN (
SELECT DISTINCT zm.MARKET_KEY
FROM `swirecc.zip_to_market_unit_mapping` zm
LEFT JOIN `swirecc.consumer_demographics` cd
ON cd.Zip = zm.ZIP_CODE
WHERE cd.State IN ('KS', 'UT', 'CA', 'CO', 'AZ', 'NM', 'NV')
) AS distinct_market_keys
ON fmd.MARKET_KEY = distinct_market_keys.MARKET_KEY
WHERE fmd.CALORIC_SEGMENT = 'REGULAR'
AND fmd.CATEGORY = 'SSD'
AND fmd.PACKAGE LIKE '%.5L MULTI JUG%'
AND fmd.MANUFACTURER = 'SWIRE-CC'
GROUP BY DATE;
# Running this code will read results from your previous job
job = client.get_job('bquxjob_4b74ac33_18e97221c14') # Job ID inserted based on the query results selected to explore
results = job.to_dataframe()
results
| DATE | UNIT_SALES | DOLLAR_SALES | |
|---|---|---|---|
| 0 | 2021-03-20 | 1.0 | 1.29 |
| 1 | 2021-04-10 | 1.0 | 1.00 |
| 2 | 2022-01-01 | 1.0 | 1.79 |
| 3 | 2021-04-03 | 1.0 | 1.19 |
| 4 | 2021-05-15 | 1.0 | 1.00 |
| 5 | 2021-07-03 | 3.0 | 2.75 |
| 6 | 2022-05-28 | 1.0 | 1.25 |
| 7 | 2021-07-31 | 1.0 | 1.00 |
| 8 | 2021-07-10 | 1.0 | 1.00 |
| 9 | 2023-02-18 | 1.0 | 1.00 |
| 10 | 2021-06-26 | 1.0 | 1.00 |
We get the data from the google big query to this notebook and next we make changes to the dataframe 'result' by changing the 'DATE' format and dividing it into year, month and week.
# Convert 'DATE' column to datetime format
results['DATE'] = pd.to_datetime(results['DATE'])
# Extract relevant features for forecasting
forecast_features = results[['DATE', 'UNIT_SALES', 'DOLLAR_SALES']]
# Add additional time-related features
forecast_features['YEAR'] = forecast_features['DATE'].dt.year
forecast_features['MONTH'] = forecast_features['DATE'].dt.month
forecast_features['WEEK_OF_YEAR'] = forecast_features['DATE'].dt.isocalendar().week
# Display the prepared forecasting features
print("Forecasting Features:")
forecast_features
Forecasting Features:
| DATE | UNIT_SALES | DOLLAR_SALES | YEAR | MONTH | WEEK_OF_YEAR | |
|---|---|---|---|---|---|---|
| 0 | 2021-03-20 | 1.0 | 1.29 | 2021 | 3 | 11 |
| 1 | 2021-04-10 | 1.0 | 1.00 | 2021 | 4 | 14 |
| 2 | 2022-01-01 | 1.0 | 1.79 | 2022 | 1 | 52 |
| 3 | 2021-04-03 | 1.0 | 1.19 | 2021 | 4 | 13 |
| 4 | 2021-05-15 | 1.0 | 1.00 | 2021 | 5 | 19 |
| 5 | 2021-07-03 | 3.0 | 2.75 | 2021 | 7 | 26 |
| 6 | 2022-05-28 | 1.0 | 1.25 | 2022 | 5 | 21 |
| 7 | 2021-07-31 | 1.0 | 1.00 | 2021 | 7 | 30 |
| 8 | 2021-07-10 | 1.0 | 1.00 | 2021 | 7 | 27 |
| 9 | 2023-02-18 | 1.0 | 1.00 | 2023 | 2 | 7 |
| 10 | 2021-06-26 | 1.0 | 1.00 | 2021 | 6 | 25 |
We follow the same pattern for the import of the dataset from the google big query to this notebook and exporting year and month and week from the 'DATE' column throughout the notebook.
Prophet is a procedure for forecasting time series data based on an additive model where non-linear trends are fit with yearly, weekly, and daily seasonality, plus holiday effects. It works best with time series that have strong seasonal effects and several seasons of historical data. Prophet is robust to missing data and shifts in the trend, and typically handles outliers well.
import pandas as pd
from prophet import Prophet
import matplotlib.pyplot as plt
from sklearn.metrics import mean_absolute_error, mean_squared_error
# Prepare the DataFrame for Prophet's convention
df_prophet_unit = forecast_features[['UNIT_SALES']].reset_index().rename(columns={'DATE': 'ds', 'UNIT_SALES': 'y'})
df_prophet_dollar = forecast_features[['DOLLAR_SALES']].reset_index().rename(columns={'DATE': 'ds', 'DOLLAR_SALES': 'y'})
# Fit the Prophet model for unit sales
prophet_model_unit = Prophet(yearly_seasonality=True, weekly_seasonality=True)
prophet_model_unit.fit(df_prophet_unit)
# Fit the Prophet model for dollar sales
prophet_model_dollar = Prophet(yearly_seasonality=True, weekly_seasonality=True)
prophet_model_dollar.fit(df_prophet_dollar)
# Create a future dataframe for one year and make predictions
future = prophet_model_unit.make_future_dataframe(periods=365)
forecast_unit = prophet_model_unit.predict(future)
forecast_dollar = prophet_model_dollar.predict(future)
# Function to find the best 13 weeks within the forecast period
def find_best_13_weeks(forecast):
forecast['rolling_sum'] = forecast['yhat'].rolling(window=91, min_periods=1, center=True).sum()
best_period_idx = forecast['rolling_sum'].idxmax()
best_period_start = forecast.iloc[best_period_idx - 91//2]['ds']
best_period_end = forecast.iloc[best_period_idx + 91//2]['ds']
return best_period_start, best_period_end
# Find the best 13 weeks for unit sales and dollar sales
best_start_unit, best_end_unit = find_best_13_weeks(forecast_unit)
best_start_dollar, best_end_dollar = find_best_13_weeks(forecast_dollar)
# Plotting function
def plot_prophet_forecast_with_highlights(model, forecast, best_start, best_end, title):
fig = model.plot(forecast)
plt.axvspan(best_start, best_end, color='orange', alpha=0.3, label='Best 13 Weeks')
plt.title(title)
plt.xlabel('Date')
plt.ylabel('Sales')
plt.legend()
plt.show()
# Plot the forecasts with the best 13 weeks highlighted
plot_prophet_forecast_with_highlights(prophet_model_unit, forecast_unit, best_start_unit, best_end_unit, 'Prophet Forecast for Unit Sales with Best 13 Weeks Highlighted')
plot_prophet_forecast_with_highlights(prophet_model_dollar, forecast_dollar, best_start_dollar, best_end_dollar, 'Prophet Forecast for Dollar Sales with Best 13 Weeks Highlighted')
INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this. DEBUG:cmdstanpy:input tempfile: /tmp/tmp49_aqdkp/bp83ieht.json DEBUG:cmdstanpy:input tempfile: /tmp/tmp49_aqdkp/ioy8jjjq.json DEBUG:cmdstanpy:idx 0 DEBUG:cmdstanpy:running CmdStan, num_threads: None DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.10/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=96238', 'data', 'file=/tmp/tmp49_aqdkp/bp83ieht.json', 'init=/tmp/tmp49_aqdkp/ioy8jjjq.json', 'output', 'file=/tmp/tmp49_aqdkp/prophet_modeln9frmlrz/prophet_model-20240330170640.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000'] 17:06:40 - cmdstanpy - INFO - Chain [1] start processing INFO:cmdstanpy:Chain [1] start processing 17:06:40 - cmdstanpy - INFO - Chain [1] done processing INFO:cmdstanpy:Chain [1] done processing INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this. DEBUG:cmdstanpy:input tempfile: /tmp/tmp49_aqdkp/jwwc0lyh.json DEBUG:cmdstanpy:input tempfile: /tmp/tmp49_aqdkp/u4i78zc8.json DEBUG:cmdstanpy:idx 0 DEBUG:cmdstanpy:running CmdStan, num_threads: None DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.10/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=88930', 'data', 'file=/tmp/tmp49_aqdkp/jwwc0lyh.json', 'init=/tmp/tmp49_aqdkp/u4i78zc8.json', 'output', 'file=/tmp/tmp49_aqdkp/prophet_model9fwrmrpe/prophet_model-20240330170641.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000'] 17:06:41 - cmdstanpy - INFO - Chain [1] start processing INFO:cmdstanpy:Chain [1] start processing 17:06:41 - cmdstanpy - INFO - Chain [1] done processing INFO:cmdstanpy:Chain [1] done processing
From this plot we can see that the best 13 weeks for unit sales are from june to august and dollor sales from april to june.
Lets evaluate model performance metrics
# Split the data into train and test sets
split_point = int(len(forecast_features) * 0.8) # for an 80/20 split
train = forecast_features.iloc[:split_point].copy() # Make a copy to avoid modifying the original DataFrame
test = forecast_features.iloc[split_point:].copy() # Make a copy to avoid modifying the original DataFrame
# Reset index to ensure 'DATE' is a regular column
train.reset_index(inplace=True)
test.reset_index(inplace=True)
train.rename(columns={'DATE': 'ds', 'UNIT_SALES': 'y'}, inplace=True)
# Fit the Prophet model for UNIT_SALES on the training set
prophet_model_unit = Prophet(yearly_seasonality=True, weekly_seasonality=True)
prophet_model_unit.fit(train[['ds', 'y']]) # Ensure 'ds' and 'y' columns are selected
# Generate forecasts for the test set period
future_unit = prophet_model_unit.make_future_dataframe(periods=len(test))
forecast_unit = prophet_model_unit.predict(future_unit)
# Calculate MAE and MSE for UNIT_SALES
mae_unit = mean_absolute_error(test['UNIT_SALES'], forecast_unit['yhat'][-len(test):])
mse_unit = mean_squared_error(test['UNIT_SALES'], forecast_unit['yhat'][-len(test):])
# Repeat the process for DOLLAR_SALES
prophet_model_dollar = Prophet(yearly_seasonality=True, weekly_seasonality=True)
prophet_model_dollar.fit(train[['ds', 'y']]) # Ensure 'ds' and 'y' columns are selected
future_dollar = prophet_model_dollar.make_future_dataframe(periods=len(test))
forecast_dollar = prophet_model_dollar.predict(future_dollar)
mae_dollar = mean_absolute_error(test['DOLLAR_SALES'], forecast_dollar['yhat'][-len(test):])
mse_dollar = mean_squared_error(test['DOLLAR_SALES'], forecast_dollar['yhat'][-len(test):])
INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this. INFO:prophet:n_changepoints greater than number of observations. Using 5. DEBUG:cmdstanpy:input tempfile: /tmp/tmpshbwn_60/no2xspaq.json DEBUG:cmdstanpy:input tempfile: /tmp/tmpshbwn_60/f2_7pksv.json DEBUG:cmdstanpy:idx 0 DEBUG:cmdstanpy:running CmdStan, num_threads: None DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.10/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=84232', 'data', 'file=/tmp/tmpshbwn_60/no2xspaq.json', 'init=/tmp/tmpshbwn_60/f2_7pksv.json', 'output', 'file=/tmp/tmpshbwn_60/prophet_modelj_h0aqbz/prophet_model-20240401005143.csv', 'method=optimize', 'algorithm=newton', 'iter=10000'] 00:51:43 - cmdstanpy - INFO - Chain [1] start processing INFO:cmdstanpy:Chain [1] start processing 00:51:44 - cmdstanpy - INFO - Chain [1] done processing INFO:cmdstanpy:Chain [1] done processing INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this. INFO:prophet:n_changepoints greater than number of observations. Using 5. DEBUG:cmdstanpy:input tempfile: /tmp/tmpshbwn_60/a7xs0kk5.json DEBUG:cmdstanpy:input tempfile: /tmp/tmpshbwn_60/tp_t4l07.json DEBUG:cmdstanpy:idx 0 DEBUG:cmdstanpy:running CmdStan, num_threads: None DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.10/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=20649', 'data', 'file=/tmp/tmpshbwn_60/a7xs0kk5.json', 'init=/tmp/tmpshbwn_60/tp_t4l07.json', 'output', 'file=/tmp/tmpshbwn_60/prophet_modelya8_eqwd/prophet_model-20240401005145.csv', 'method=optimize', 'algorithm=newton', 'iter=10000'] 00:51:45 - cmdstanpy - INFO - Chain [1] start processing INFO:cmdstanpy:Chain [1] start processing 00:51:46 - cmdstanpy - INFO - Chain [1] done processing INFO:cmdstanpy:Chain [1] done processing
# Print the evaluation metrics
print(f'UNIT_SALES - MAE: {mae_unit}, MSE: {mse_unit}')
print(f'DOLLAR_SALES - MAE: {mae_dollar}, MSE: {mse_dollar}')
UNIT_SALES - MAE: 29.616903467731532, MSE: 953.6583555096549 DOLLAR_SALES - MAE: 29.963570134398196, MSE: 975.9784331169152
The MAE and MSE values for unit sales are 29 and 953. For dollor sales the respected values are 29 and 975.
From this model we can see that the best 13 weeks for unit sales are from june to august and dollor sales from april to june.
Let's view the .5L Multi Jug with the non swire cc manufacturer.
We first filter the package '.5L Multi Jug' with 'Non Swire-cc', category 'SSD' and 'Regular' caloric segment in Southern Regions like KS, UT, CA, CO, AZ, NM, NV
Before building the model to forecast the sales, let's check the data and find the similar products in the given dataset. Since, the dataset is large we use Google Big query to filter the data based on the requirements and import those filtered data to this notebook.
In the dataset provided to us we don't have dataset with combinations of Flavor: 'Pink Woodsy'. So, we fist consider the other Package : '.5L multijug' Caloric Segment: Regular, Market Category: SSD, Manufacturer!= Swire-CC.
from google.colab import auth
from google.cloud import bigquery
from google.colab import data_table
project = 'spring-swire-ca' # Project ID inserted based on the query results selected to explore
location = 'US' # Location inserted based on the query results selected to explore
client = bigquery.Client(project=project, location=location)
data_table.enable_dataframe_formatter()
auth.authenticate_user()
job = client.get_job('bquxjob_49035d5_18e936b4f5e') # Job ID inserted based on the query results selected to explore
print(job.query)
SELECT
fmd.DATE,
SUM(fmd.UNIT_SALES) AS UNIT_SALES,
SUM(fmd.DOLLAR_SALES) AS DOLLAR_SALES
FROM
`swirecc.fact_market_demand` fmd
JOIN (
SELECT DISTINCT zm.MARKET_KEY
FROM `swirecc.zip_to_market_unit_mapping` zm
JOIN `swirecc.consumer_demographics` cd ON zm.ZIP_CODE = cd.Zip
WHERE cd.State IN ('KS', 'UT', 'CA', 'CO', 'AZ', 'NM', 'NV')
) AS distinct_market_keys ON fmd.MARKET_KEY = distinct_market_keys.MARKET_KEY
WHERE
fmd.PACKAGE LIKE '%.5L MULTI JUG%'
AND fmd.CALORIC_SEGMENT = 'REGULAR'
AND fmd.CATEGORY = 'SSD'
AND fmd.MANUFACTURER != 'SWIRE-CC'
GROUP BY
fmd.DATE;
job = client.get_job('bquxjob_49035d5_18e936b4f5e') # Job ID inserted based on the query results selected to explore
results = job.to_dataframe()
results
| DATE | UNIT_SALES | DOLLAR_SALES | |
|---|---|---|---|
| 0 | 2023-10-21 | 1.0 | 1.50 |
| 1 | 2022-02-12 | 2.0 | 2.50 |
| 2 | 2023-04-08 | 3.0 | 4.50 |
| 3 | 2022-01-08 | 9.0 | 11.25 |
| 4 | 2021-10-30 | 1.0 | 1.25 |
| 5 | 2022-08-06 | 1.0 | 1.25 |
| 6 | 2021-03-20 | 1.0 | 1.79 |
| 7 | 2021-08-21 | 3.0 | 3.00 |
| 8 | 2022-04-23 | 2.0 | 2.50 |
| 9 | 2023-05-06 | 2.0 | 2.00 |
| 10 | 2022-07-30 | 1.0 | 1.25 |
| 11 | 2023-01-28 | 3.0 | 4.50 |
| 12 | 2021-10-16 | 3.0 | 3.00 |
| 13 | 2021-12-18 | 4.0 | 4.25 |
| 14 | 2023-10-07 | 3.0 | 4.00 |
| 15 | 2023-03-25 | 1.0 | 1.50 |
| 16 | 2022-12-17 | 1.0 | 1.50 |
| 17 | 2022-11-26 | 2.0 | 2.75 |
| 18 | 2023-03-04 | 5.0 | 5.00 |
| 19 | 2023-02-18 | 3.0 | 4.50 |
| 20 | 2022-07-02 | 3.0 | 3.75 |
| 21 | 2022-06-11 | 1.0 | 1.25 |
| 22 | 2023-08-26 | 1.0 | 1.50 |
| 23 | 2023-03-11 | 2.0 | 3.00 |
| 24 | 2023-02-04 | 1.0 | 1.50 |
| 25 | 2022-06-18 | 2.0 | 2.00 |
| 26 | 2023-08-05 | 4.0 | 5.50 |
| 27 | 2021-08-28 | 1.0 | 1.00 |
| 28 | 2023-07-08 | 1.0 | 1.50 |
| 29 | 2023-05-20 | 1.0 | 1.50 |
| 30 | 2022-09-17 | 3.0 | 3.75 |
| 31 | 2022-07-23 | 1.0 | 1.00 |
| 32 | 2022-04-09 | 1.0 | 1.25 |
| 33 | 2022-12-31 | 1.0 | 1.50 |
| 34 | 2023-07-01 | 1.0 | 1.50 |
| 35 | 2022-07-16 | 1.0 | 1.25 |
| 36 | 2022-03-26 | 3.0 | 3.75 |
| 37 | 2023-05-13 | 1.0 | 1.50 |
| 38 | 2022-08-27 | 1.0 | 1.25 |
We get the data from the google big query to this notebook and next we make changes to the dataframe 'result' by changing the 'DATE' format and dividing it into year, month and week.
import pandas as pd
# Convert 'DATE' column to datetime format
results['DATE'] = pd.to_datetime(results['DATE'])
# Extract relevant features for forecasting
forecast_features = results[['DATE', 'UNIT_SALES', 'DOLLAR_SALES']]
# Add additional time-related features
forecast_features['YEAR'] = forecast_features['DATE'].dt.year
forecast_features['MONTH'] = forecast_features['DATE'].dt.month
forecast_features['WEEK_OF_YEAR'] = forecast_features['DATE'].dt.isocalendar().week
# Display the prepared forecasting features
print("Forecasting Features:")
forecast_features
Forecasting Features:
| DATE | UNIT_SALES | DOLLAR_SALES | YEAR | MONTH | WEEK_OF_YEAR | |
|---|---|---|---|---|---|---|
| 0 | 2023-10-21 | 1.0 | 1.50 | 2023 | 10 | 42 |
| 1 | 2022-02-12 | 2.0 | 2.50 | 2022 | 2 | 6 |
| 2 | 2023-04-08 | 3.0 | 4.50 | 2023 | 4 | 14 |
| 3 | 2022-01-08 | 9.0 | 11.25 | 2022 | 1 | 1 |
| 4 | 2021-10-30 | 1.0 | 1.25 | 2021 | 10 | 43 |
| 5 | 2022-08-06 | 1.0 | 1.25 | 2022 | 8 | 31 |
| 6 | 2021-03-20 | 1.0 | 1.79 | 2021 | 3 | 11 |
| 7 | 2021-08-21 | 3.0 | 3.00 | 2021 | 8 | 33 |
| 8 | 2022-04-23 | 2.0 | 2.50 | 2022 | 4 | 16 |
| 9 | 2023-05-06 | 2.0 | 2.00 | 2023 | 5 | 18 |
| 10 | 2022-07-30 | 1.0 | 1.25 | 2022 | 7 | 30 |
| 11 | 2023-01-28 | 3.0 | 4.50 | 2023 | 1 | 4 |
| 12 | 2021-10-16 | 3.0 | 3.00 | 2021 | 10 | 41 |
| 13 | 2021-12-18 | 4.0 | 4.25 | 2021 | 12 | 50 |
| 14 | 2023-10-07 | 3.0 | 4.00 | 2023 | 10 | 40 |
| 15 | 2023-03-25 | 1.0 | 1.50 | 2023 | 3 | 12 |
| 16 | 2022-12-17 | 1.0 | 1.50 | 2022 | 12 | 50 |
| 17 | 2022-11-26 | 2.0 | 2.75 | 2022 | 11 | 47 |
| 18 | 2023-03-04 | 5.0 | 5.00 | 2023 | 3 | 9 |
| 19 | 2023-02-18 | 3.0 | 4.50 | 2023 | 2 | 7 |
| 20 | 2022-07-02 | 3.0 | 3.75 | 2022 | 7 | 26 |
| 21 | 2022-06-11 | 1.0 | 1.25 | 2022 | 6 | 23 |
| 22 | 2023-08-26 | 1.0 | 1.50 | 2023 | 8 | 34 |
| 23 | 2023-03-11 | 2.0 | 3.00 | 2023 | 3 | 10 |
| 24 | 2023-02-04 | 1.0 | 1.50 | 2023 | 2 | 5 |
| 25 | 2022-06-18 | 2.0 | 2.00 | 2022 | 6 | 24 |
| 26 | 2023-08-05 | 4.0 | 5.50 | 2023 | 8 | 31 |
| 27 | 2021-08-28 | 1.0 | 1.00 | 2021 | 8 | 34 |
| 28 | 2023-07-08 | 1.0 | 1.50 | 2023 | 7 | 27 |
| 29 | 2023-05-20 | 1.0 | 1.50 | 2023 | 5 | 20 |
| 30 | 2022-09-17 | 3.0 | 3.75 | 2022 | 9 | 37 |
| 31 | 2022-07-23 | 1.0 | 1.00 | 2022 | 7 | 29 |
| 32 | 2022-04-09 | 1.0 | 1.25 | 2022 | 4 | 14 |
| 33 | 2022-12-31 | 1.0 | 1.50 | 2022 | 12 | 52 |
| 34 | 2023-07-01 | 1.0 | 1.50 | 2023 | 7 | 26 |
| 35 | 2022-07-16 | 1.0 | 1.25 | 2022 | 7 | 28 |
| 36 | 2022-03-26 | 3.0 | 3.75 | 2022 | 3 | 12 |
| 37 | 2023-05-13 | 1.0 | 1.50 | 2023 | 5 | 19 |
| 38 | 2022-08-27 | 1.0 | 1.25 | 2022 | 8 | 34 |
We follow the same pattern for the import of the dataset from the google big query to this notebook and exporting year and month and week from the 'DATE' column throughout the notebook.
Prophet is a procedure for forecasting time series data based on an additive model where non-linear trends are fit with yearly, weekly, and daily seasonality, plus holiday effects. It works best with time series that have strong seasonal effects and several seasons of historical data. Prophet is robust to missing data and shifts in the trend, and typically handles outliers well.
import pandas as pd
from prophet import Prophet
import matplotlib.pyplot as plt
from sklearn.metrics import mean_absolute_error, mean_squared_error
# Ensure the DATE column is in datetime format and set as the DataFrame's index
forecast_features['DATE'] = pd.to_datetime(forecast_features['DATE'])
forecast_features = forecast_features.set_index('DATE').sort_index()
# Prepare the DataFrame for Prophet's convention
df_prophet_unit = forecast_features[['UNIT_SALES']].reset_index().rename(columns={'DATE': 'ds', 'UNIT_SALES': 'y'})
df_prophet_dollar = forecast_features[['DOLLAR_SALES']].reset_index().rename(columns={'DATE': 'ds', 'DOLLAR_SALES': 'y'})
# Fit the Prophet model for unit sales
prophet_model_unit = Prophet(yearly_seasonality=True, weekly_seasonality=True)
prophet_model_unit.fit(df_prophet_unit)
# Fit the Prophet model for dollar sales
prophet_model_dollar = Prophet(yearly_seasonality=True, weekly_seasonality=True)
prophet_model_dollar.fit(df_prophet_dollar)
# Create a future dataframe for one year and make predictions
future = prophet_model_unit.make_future_dataframe(periods=365)
forecast_unit = prophet_model_unit.predict(future)
forecast_dollar = prophet_model_dollar.predict(future)
# Function to find the best 13 weeks within the forecast period
def find_best_13_weeks(forecast):
forecast['rolling_sum'] = forecast['yhat'].rolling(window=91, min_periods=1, center=True).sum()
best_period_idx = forecast['rolling_sum'].idxmax()
best_period_start = forecast.iloc[best_period_idx - 91//2]['ds']
best_period_end = forecast.iloc[best_period_idx + 91//2]['ds']
return best_period_start, best_period_end
# Find the best 13 weeks for unit sales and dollar sales
best_start_unit, best_end_unit = find_best_13_weeks(forecast_unit)
best_start_dollar, best_end_dollar = find_best_13_weeks(forecast_dollar)
# Plotting function
def plot_prophet_forecast_with_highlights(model, forecast, best_start, best_end, title):
fig = model.plot(forecast)
plt.axvspan(best_start, best_end, color='orange', alpha=0.3, label='Best 13 Weeks')
plt.title(title)
plt.xlabel('Date')
plt.ylabel('Sales')
plt.legend()
plt.show()
# Plot the forecasts with the best 13 weeks highlighted
plot_prophet_forecast_with_highlights(prophet_model_unit, forecast_unit, best_start_unit, best_end_unit, 'Prophet Forecast for Unit Sales with Best 13 Weeks Highlighted')
plot_prophet_forecast_with_highlights(prophet_model_dollar, forecast_dollar, best_start_dollar, best_end_dollar, 'Prophet Forecast for Dollar Sales with Best 13 Weeks Highlighted')
INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this. DEBUG:cmdstanpy:input tempfile: /tmp/tmpshbwn_60/nf8sdzrs.json DEBUG:cmdstanpy:input tempfile: /tmp/tmpshbwn_60/dogo5iah.json DEBUG:cmdstanpy:idx 0 DEBUG:cmdstanpy:running CmdStan, num_threads: None DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.10/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=76972', 'data', 'file=/tmp/tmpshbwn_60/nf8sdzrs.json', 'init=/tmp/tmpshbwn_60/dogo5iah.json', 'output', 'file=/tmp/tmpshbwn_60/prophet_model4nrkhp6h/prophet_model-20240401010039.csv', 'method=optimize', 'algorithm=newton', 'iter=10000'] 01:00:39 - cmdstanpy - INFO - Chain [1] start processing INFO:cmdstanpy:Chain [1] start processing 01:00:40 - cmdstanpy - INFO - Chain [1] done processing INFO:cmdstanpy:Chain [1] done processing INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this. DEBUG:cmdstanpy:input tempfile: /tmp/tmpshbwn_60/_swtq71u.json DEBUG:cmdstanpy:input tempfile: /tmp/tmpshbwn_60/z78r3_ko.json DEBUG:cmdstanpy:idx 0 DEBUG:cmdstanpy:running CmdStan, num_threads: None DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.10/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=50596', 'data', 'file=/tmp/tmpshbwn_60/_swtq71u.json', 'init=/tmp/tmpshbwn_60/z78r3_ko.json', 'output', 'file=/tmp/tmpshbwn_60/prophet_modelr60qqzxx/prophet_model-20240401010040.csv', 'method=optimize', 'algorithm=newton', 'iter=10000'] 01:00:40 - cmdstanpy - INFO - Chain [1] start processing INFO:cmdstanpy:Chain [1] start processing 01:00:41 - cmdstanpy - INFO - Chain [1] done processing INFO:cmdstanpy:Chain [1] done processing
From this plot we can see that the best 13 weeks for unit sales and dollor sales are from december to march.
Lets evaluate the model performance metrics
# Split the data into train and test sets
split_point = int(len(forecast_features) * 0.8) # for an 80/20 split
train = forecast_features.iloc[:split_point].copy() # Make a copy to avoid modifying the original DataFrame
test = forecast_features.iloc[split_point:].copy() # Make a copy to avoid modifying the original DataFrame
# Reset index to ensure 'DATE' is a regular column
train.reset_index(inplace=True)
test.reset_index(inplace=True)
train.rename(columns={'DATE': 'ds', 'UNIT_SALES': 'y'}, inplace=True)
# Fit the Prophet model for UNIT_SALES on the training set
prophet_model_unit = Prophet(yearly_seasonality=True, weekly_seasonality=True)
prophet_model_unit.fit(train[['ds', 'y']]) # Ensure 'ds' and 'y' columns are selected
# Generate forecasts for the test set period
future_unit = prophet_model_unit.make_future_dataframe(periods=len(test))
forecast_unit = prophet_model_unit.predict(future_unit)
# Calculate MAE and MSE for UNIT_SALES
mae_unit = mean_absolute_error(test['UNIT_SALES'], forecast_unit['yhat'][-len(test):])
mse_unit = mean_squared_error(test['UNIT_SALES'], forecast_unit['yhat'][-len(test):])
# Repeat the process for DOLLAR_SALES
prophet_model_dollar = Prophet(yearly_seasonality=True, weekly_seasonality=True)
prophet_model_dollar.fit(train[['ds', 'y']]) # Ensure 'ds' and 'y' columns are selected
future_dollar = prophet_model_dollar.make_future_dataframe(periods=len(test))
forecast_dollar = prophet_model_dollar.predict(future_dollar)
mae_dollar = mean_absolute_error(test['DOLLAR_SALES'], forecast_dollar['yhat'][-len(test):])
mse_dollar = mean_squared_error(test['DOLLAR_SALES'], forecast_dollar['yhat'][-len(test):])
INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this. INFO:prophet:n_changepoints greater than number of observations. Using 23. DEBUG:cmdstanpy:input tempfile: /tmp/tmpshbwn_60/nr4yktzz.json DEBUG:cmdstanpy:input tempfile: /tmp/tmpshbwn_60/ccyyebkr.json DEBUG:cmdstanpy:idx 0 DEBUG:cmdstanpy:running CmdStan, num_threads: None DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.10/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=99901', 'data', 'file=/tmp/tmpshbwn_60/nr4yktzz.json', 'init=/tmp/tmpshbwn_60/ccyyebkr.json', 'output', 'file=/tmp/tmpshbwn_60/prophet_modelkczh018_/prophet_model-20240401010053.csv', 'method=optimize', 'algorithm=newton', 'iter=10000'] 01:00:53 - cmdstanpy - INFO - Chain [1] start processing INFO:cmdstanpy:Chain [1] start processing 01:00:54 - cmdstanpy - INFO - Chain [1] done processing INFO:cmdstanpy:Chain [1] done processing INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this. INFO:prophet:n_changepoints greater than number of observations. Using 23. DEBUG:cmdstanpy:input tempfile: /tmp/tmpshbwn_60/vlqiw37n.json DEBUG:cmdstanpy:input tempfile: /tmp/tmpshbwn_60/wxkb11wf.json DEBUG:cmdstanpy:idx 0 DEBUG:cmdstanpy:running CmdStan, num_threads: None DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.10/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=62535', 'data', 'file=/tmp/tmpshbwn_60/vlqiw37n.json', 'init=/tmp/tmpshbwn_60/wxkb11wf.json', 'output', 'file=/tmp/tmpshbwn_60/prophet_modelu2nv_qi4/prophet_model-20240401010054.csv', 'method=optimize', 'algorithm=newton', 'iter=10000'] 01:00:54 - cmdstanpy - INFO - Chain [1] start processing INFO:cmdstanpy:Chain [1] start processing 01:00:55 - cmdstanpy - INFO - Chain [1] done processing INFO:cmdstanpy:Chain [1] done processing
# Print the evaluation metrics
print(f'UNIT_SALES - MAE: {mae_unit}, MSE: {mse_unit}')
print(f'DOLLAR_SALES - MAE: {mae_dollar}, MSE: {mse_dollar}')
UNIT_SALES - MAE: 3.382974095751682, MSE: 12.899271272796936 DOLLAR_SALES - MAE: 4.070474095751682, MSE: 18.703416615016657
The MAE and MSE values for unit sales are 3 and 12. For dollor sales the respected values are 4 and 18.
Since, the MAE values of the prophet model is very low, so the prophet model is good and From this model we can see that the best 13 weeks for unit sales are from june to august and dollor sales from april to june.
Item Description: Greetingle Health Beverage Woodsy Yellow .5L 12One Jug
Caloric Segment: Regular
Market Category: ING Enhanced Water
Manufacturer: Swire-CC
Brand: Greetingle
Package Type: .5L 12One Jug
Flavor: ‘Woodsy Yellow’
Swire plans to release this product for 13 weeks, but only in one region.
Which region would it perform best in?
We first filter package '.5L 12One Jug', category 'Ing Enhanced Water' with 'Swire-cc' manufacturer in North Regions
Before building the model to forecast the sales, let's check the data and find the similar products in the given dataset. Since, the dataset is large we use Google Big query to filter the data based on the requirements and import those filtered data to this notebook.
In the dataset provided to us we don't have dataset with combinations of Flavor: 'Woodsy yellow'. So, we fist consider the other Package : '.5L 12one jug' Market Category: ing enhanced water, Manufacturer : Swire-CC.
from google.colab import auth
from google.cloud import bigquery
from google.colab import data_table
project = 'spring-swire-ca' # Project ID inserted based on the query results selected to explore
location = 'US' # Location inserted based on the query results selected to explore
client = bigquery.Client(project=project, location=location)
data_table.enable_dataframe_formatter()
auth.authenticate_user()
job = client.get_job('bquxjob_28d5a02f_18e978a1a5b') # Job ID inserted based on the query results selected to explore
print(job.query)
SELECT
fmd.DATE,
SUM(fmd.UNIT_SALES) AS UNIT_SALES,
SUM(fmd.DOLLAR_SALES) AS DOLLAR_SALES
FROM `swirecc.fact_market_demand` fmd
JOIN (
SELECT DISTINCT zm.MARKET_KEY
FROM `swirecc.zip_to_market_unit_mapping` zm
LEFT JOIN `swirecc.consumer_demographics` cd
ON cd.Zip = zm.ZIP_CODE
WHERE cd.State NOT IN ('KS', 'UT', 'CA', 'CO', 'AZ', 'NM', 'NV')
) AS distinct_market_keys
ON fmd.MARKET_KEY = distinct_market_keys.MARKET_KEY
WHERE fmd.CATEGORY = 'ING ENHANCED WATER'
AND fmd.MANUFACTURER = 'SWIRE-CC'
AND fmd.PACKAGE = '.5L 12ONE JUG'
GROUP BY
fmd.DATE;
# Running this code will read results from your previous job
job = client.get_job('bquxjob_28d5a02f_18e978a1a5b') # Job ID inserted based on the query results selected to explore
results = job.to_dataframe()
results
| DATE | UNIT_SALES | DOLLAR_SALES | |
|---|---|---|---|
| 0 | 2021-02-06 | 5.0 | 12.56 |
| 1 | 2021-07-31 | 5.0 | 14.25 |
| 2 | 2021-08-07 | 2.0 | 5.68 |
| 3 | 2021-05-15 | 6.0 | 34.67 |
| 4 | 2021-07-03 | 6.0 | 14.26 |
| 5 | 2021-12-04 | 1.0 | 2.49 |
| 6 | 2021-10-09 | 3.0 | 28.86 |
| 7 | 2021-08-14 | 5.0 | 12.45 |
| 8 | 2021-07-10 | 4.0 | 9.96 |
| 9 | 2021-08-28 | 1.0 | 2.49 |
| 10 | 2021-10-02 | 1.0 | 2.49 |
| 11 | 2021-01-30 | 3.0 | 16.98 |
| 12 | 2021-05-01 | 5.0 | 11.47 |
| 13 | 2022-02-12 | 13.0 | 32.37 |
| 14 | 2022-02-26 | 1.0 | 2.49 |
| 15 | 2021-06-26 | 2.0 | 4.98 |
| 16 | 2021-12-25 | 10.0 | 25.20 |
| 17 | 2021-09-04 | 1.0 | 2.49 |
| 18 | 2021-05-08 | 2.0 | 4.00 |
| 19 | 2021-05-29 | 4.0 | 8.00 |
| 20 | 2021-02-13 | 11.0 | 25.25 |
| 21 | 2021-03-20 | 7.0 | 16.45 |
| 22 | 2023-06-10 | 1.0 | 23.88 |
| 23 | 2021-06-12 | 4.0 | 9.96 |
| 24 | 2021-03-27 | 2.0 | 4.00 |
| 25 | 2021-09-11 | 1.0 | 2.49 |
| 26 | 2022-01-08 | 4.0 | 11.16 |
| 27 | 2021-10-30 | 1.0 | 2.49 |
| 28 | 2021-03-13 | 2.0 | 4.00 |
| 29 | 2021-07-17 | 2.0 | 4.98 |
| 30 | 2021-04-24 | 5.0 | 10.49 |
| 31 | 2021-04-03 | 1.0 | 2.49 |
| 32 | 2021-01-16 | 1.0 | 20.00 |
| 33 | 2022-03-12 | 1.0 | 2.49 |
| 34 | 2021-06-05 | 6.0 | 12.98 |
| 35 | 2021-09-18 | 2.0 | 2.91 |
| 36 | 2021-02-27 | 3.0 | 7.47 |
| 37 | 2021-09-25 | 16.0 | 38.17 |
| 38 | 2021-03-06 | 17.0 | 34.39 |
| 39 | 2021-02-20 | 14.0 | 50.37 |
| 40 | 2021-04-10 | 4.0 | 8.98 |
| 41 | 2021-05-22 | 18.0 | 36.06 |
| 42 | 2023-07-08 | 1.0 | 23.88 |
| 43 | 2021-04-17 | 4.0 | 8.00 |
| 44 | 2021-01-23 | 1.0 | 24.00 |
| 45 | 2021-06-19 | 8.0 | 16.00 |
| 46 | 2022-06-25 | 15.0 | 40.35 |
| 47 | 2021-10-23 | 2.0 | 4.98 |
We get the data from the google big query to this notebook and next we make changes to the dataframe 'result' by changing the 'DATE' format and dividing it into year, month and week.
import pandas as pd
# Convert 'DATE' column to datetime format
results['DATE'] = pd.to_datetime(results['DATE'])
# Extract relevant features for forecasting
forecast_features = results[['DATE', 'UNIT_SALES', 'DOLLAR_SALES']]
# Add additional time-related features
forecast_features['YEAR'] = forecast_features['DATE'].dt.year
forecast_features['MONTH'] = forecast_features['DATE'].dt.month
forecast_features['WEEK_OF_YEAR'] = forecast_features['DATE'].dt.isocalendar().week
# Display the prepared forecasting features
print("Forecasting Features:")
forecast_features
Forecasting Features:
| DATE | UNIT_SALES | DOLLAR_SALES | YEAR | MONTH | WEEK_OF_YEAR | |
|---|---|---|---|---|---|---|
| 0 | 2021-05-22 | 18.0 | 36.06 | 2021 | 5 | 20 |
| 1 | 2021-04-10 | 4.0 | 8.98 | 2021 | 4 | 14 |
| 2 | 2021-04-17 | 4.0 | 8.00 | 2021 | 4 | 15 |
| 3 | 2021-01-23 | 1.0 | 24.00 | 2021 | 1 | 3 |
| 4 | 2021-06-19 | 8.0 | 16.00 | 2021 | 6 | 24 |
| 5 | 2023-07-08 | 1.0 | 23.88 | 2023 | 7 | 27 |
| 6 | 2021-12-04 | 1.0 | 2.49 | 2021 | 12 | 48 |
| 7 | 2021-05-15 | 6.0 | 34.67 | 2021 | 5 | 19 |
| 8 | 2021-07-03 | 6.0 | 14.26 | 2021 | 7 | 26 |
| 9 | 2021-10-09 | 3.0 | 28.86 | 2021 | 10 | 40 |
| 10 | 2021-03-13 | 2.0 | 4.00 | 2021 | 3 | 10 |
| 11 | 2021-04-03 | 1.0 | 2.49 | 2021 | 4 | 13 |
| 12 | 2021-04-24 | 5.0 | 10.49 | 2021 | 4 | 16 |
| 13 | 2021-07-17 | 2.0 | 4.98 | 2021 | 7 | 28 |
| 14 | 2021-08-14 | 5.0 | 12.45 | 2021 | 8 | 32 |
| 15 | 2021-10-02 | 1.0 | 2.49 | 2021 | 10 | 39 |
| 16 | 2021-07-10 | 4.0 | 9.96 | 2021 | 7 | 27 |
| 17 | 2021-08-28 | 1.0 | 2.49 | 2021 | 8 | 34 |
| 18 | 2021-02-06 | 5.0 | 12.56 | 2021 | 2 | 5 |
| 19 | 2021-08-07 | 2.0 | 5.68 | 2021 | 8 | 31 |
| 20 | 2021-07-31 | 5.0 | 14.25 | 2021 | 7 | 30 |
| 21 | 2021-09-04 | 1.0 | 2.49 | 2021 | 9 | 35 |
| 22 | 2021-12-25 | 10.0 | 25.20 | 2021 | 12 | 51 |
| 23 | 2021-05-01 | 5.0 | 11.47 | 2021 | 5 | 17 |
| 24 | 2021-01-30 | 3.0 | 16.98 | 2021 | 1 | 4 |
| 25 | 2022-02-12 | 13.0 | 32.37 | 2022 | 2 | 6 |
| 26 | 2022-02-26 | 1.0 | 2.49 | 2022 | 2 | 8 |
| 27 | 2021-06-26 | 2.0 | 4.98 | 2021 | 6 | 25 |
| 28 | 2021-05-08 | 2.0 | 4.00 | 2021 | 5 | 18 |
| 29 | 2021-09-18 | 2.0 | 2.91 | 2021 | 9 | 37 |
| 30 | 2021-02-27 | 3.0 | 7.47 | 2021 | 2 | 8 |
| 31 | 2021-06-05 | 6.0 | 12.98 | 2021 | 6 | 22 |
| 32 | 2021-01-16 | 1.0 | 20.00 | 2021 | 1 | 2 |
| 33 | 2022-03-12 | 1.0 | 2.49 | 2022 | 3 | 10 |
| 34 | 2021-02-13 | 11.0 | 25.25 | 2021 | 2 | 6 |
| 35 | 2021-05-29 | 4.0 | 8.00 | 2021 | 5 | 21 |
| 36 | 2021-06-12 | 4.0 | 9.96 | 2021 | 6 | 23 |
| 37 | 2021-03-27 | 2.0 | 4.00 | 2021 | 3 | 12 |
| 38 | 2023-06-10 | 1.0 | 23.88 | 2023 | 6 | 23 |
| 39 | 2021-03-20 | 7.0 | 16.45 | 2021 | 3 | 11 |
| 40 | 2021-10-30 | 1.0 | 2.49 | 2021 | 10 | 43 |
| 41 | 2022-01-08 | 4.0 | 11.16 | 2022 | 1 | 1 |
| 42 | 2021-09-11 | 1.0 | 2.49 | 2021 | 9 | 36 |
| 43 | 2022-06-25 | 15.0 | 40.35 | 2022 | 6 | 25 |
| 44 | 2021-10-23 | 2.0 | 4.98 | 2021 | 10 | 42 |
| 45 | 2021-09-25 | 16.0 | 38.17 | 2021 | 9 | 38 |
| 46 | 2021-02-20 | 14.0 | 50.37 | 2021 | 2 | 7 |
| 47 | 2021-03-06 | 17.0 | 34.39 | 2021 | 3 | 9 |
We follow the same pattern for the import of the dataset from the google big query to this notebook and exporting year and month and week from the 'DATE' column throughout the notebook
Prophet is a procedure for forecasting time series data based on an additive model where non-linear trends are fit with yearly, weekly, and daily seasonality, plus holiday effects. It works best with time series that have strong seasonal effects and several seasons of historical data. Prophet is robust to missing data and shifts in the trend, and typically handles outliers well.
import pandas as pd
from prophet import Prophet
import matplotlib.pyplot as plt
# Converting the 'DATE' column to datetime and sort by date
forecast_features['DATE'] = pd.to_datetime(forecast_features['DATE'])
forecast_features.sort_values('DATE', inplace=True)
# Prepare the DataFrame for Prophet's convention
df_prophet_unit = forecast_features[['DATE', 'UNIT_SALES']].rename(columns={'DATE': 'ds', 'UNIT_SALES': 'y'})
df_prophet_dollar = forecast_features[['DATE', 'DOLLAR_SALES']].rename(columns={'DATE': 'ds', 'DOLLAR_SALES': 'y'})
# Fit the Prophet model for unit sales
prophet_model_unit = Prophet(yearly_seasonality=True, weekly_seasonality=True)
prophet_model_unit.fit(df_prophet_unit)
# Fit the Prophet model for dollar sales
prophet_model_dollar = Prophet(yearly_seasonality=True, weekly_seasonality=True)
prophet_model_dollar.fit(df_prophet_dollar)
# Create a future dataframe for one year and make predictions
future_unit = prophet_model_unit.make_future_dataframe(periods=365)
future_dollar = prophet_model_dollar.make_future_dataframe(periods=365)
forecast_unit = prophet_model_unit.predict(future_unit)
forecast_dollar = prophet_model_dollar.predict(future_dollar)
# Get the last historical date
last_historical_date = df_prophet_unit['ds'].max()
# Function to find the best 13 weeks within the forecast period
def find_best_13_weeks(forecast, last_historical_date):
# Restrict to forecasted data after the last historical date
forecast_future = forecast[forecast['ds'] > last_historical_date]
forecast_future['rolling_sum'] = forecast_future['yhat'].rolling(window=13, min_periods=1).sum()
best_period_idx = forecast_future['rolling_sum'].idxmax()
best_period_start = forecast_future.loc[best_period_idx]['ds']
best_period_end = forecast_future.loc[best_period_idx]['ds'] + pd.DateOffset(weeks=12) # Include the end week
return best_period_start, best_period_end
# Find the best 13 weeks for unit sales and dollar sales
best_start_unit, best_end_unit = find_best_13_weeks(forecast_unit, last_historical_date)
best_start_dollar, best_end_dollar = find_best_13_weeks(forecast_dollar, last_historical_date)
# Plotting function
def plot_prophet_forecast_with_highlights(model, forecast, best_start, best_end, title):
fig = model.plot(forecast)
plt.axvspan(best_start, best_end, color='orange', alpha=0.3, label='Best 13 Weeks')
plt.title(title)
plt.xlabel('Date')
plt.ylabel('Sales')
plt.legend()
plt.show()
# Plot the forecasts with the best 13 weeks highlighted
plot_prophet_forecast_with_highlights(prophet_model_unit, forecast_unit, best_start_unit, best_end_unit, 'Prophet Forecast for Unit Sales with Best 13 Weeks Highlighted')
plot_prophet_forecast_with_highlights(prophet_model_dollar, forecast_dollar, best_start_dollar, best_end_dollar, 'Prophet Forecast for Dollar Sales with Best 13 Weeks Highlighted')
INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this. DEBUG:cmdstanpy:input tempfile: /tmp/tmpu6u1ud2o/0svx7mb1.json DEBUG:cmdstanpy:input tempfile: /tmp/tmpu6u1ud2o/_phittwi.json DEBUG:cmdstanpy:idx 0 DEBUG:cmdstanpy:running CmdStan, num_threads: None DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.10/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=31875', 'data', 'file=/tmp/tmpu6u1ud2o/0svx7mb1.json', 'init=/tmp/tmpu6u1ud2o/_phittwi.json', 'output', 'file=/tmp/tmpu6u1ud2o/prophet_modelwmhj1n5r/prophet_model-20240331061732.csv', 'method=optimize', 'algorithm=newton', 'iter=10000'] 06:17:32 - cmdstanpy - INFO - Chain [1] start processing INFO:cmdstanpy:Chain [1] start processing 06:17:33 - cmdstanpy - INFO - Chain [1] done processing INFO:cmdstanpy:Chain [1] done processing INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this. DEBUG:cmdstanpy:input tempfile: /tmp/tmpu6u1ud2o/phh7oqcn.json DEBUG:cmdstanpy:input tempfile: /tmp/tmpu6u1ud2o/_naalr95.json DEBUG:cmdstanpy:idx 0 DEBUG:cmdstanpy:running CmdStan, num_threads: None DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.10/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=71823', 'data', 'file=/tmp/tmpu6u1ud2o/phh7oqcn.json', 'init=/tmp/tmpu6u1ud2o/_naalr95.json', 'output', 'file=/tmp/tmpu6u1ud2o/prophet_modelnezw0r1x/prophet_model-20240331061733.csv', 'method=optimize', 'algorithm=newton', 'iter=10000'] 06:17:33 - cmdstanpy - INFO - Chain [1] start processing INFO:cmdstanpy:Chain [1] start processing 06:17:34 - cmdstanpy - INFO - Chain [1] done processing INFO:cmdstanpy:Chain [1] done processing
From this plot we can see that the best 13 weeks for unit sales are from januay to march and dollor sales are from may to august.
Lets evaluate the model performance metrics.
# Split the data into train and test sets
split_point = int(len(forecast_features) * 0.8) # for an 80/20 split
train = forecast_features.iloc[:split_point].copy() # Make a copy to avoid modifying the original DataFrame
test = forecast_features.iloc[split_point:].copy() # Make a copy to avoid modifying the original DataFrame
# Reset index to ensure 'DATE' is a regular column
train.reset_index(inplace=True)
test.reset_index(inplace=True)
train.rename(columns={'DATE': 'ds', 'UNIT_SALES': 'y'}, inplace=True)
# Fit the Prophet model for UNIT_SALES on the training set
prophet_model_unit = Prophet(yearly_seasonality=True, weekly_seasonality=True)
prophet_model_unit.fit(train[['ds', 'y']]) # Ensure 'ds' and 'y' columns are selected
# Generate forecasts for the test set period
future_unit = prophet_model_unit.make_future_dataframe(periods=len(test))
forecast_unit = prophet_model_unit.predict(future_unit)
# Calculate MAE and MSE for UNIT_SALES
mae_unit = mean_absolute_error(test['UNIT_SALES'], forecast_unit['yhat'][-len(test):])
mse_unit = mean_squared_error(test['UNIT_SALES'], forecast_unit['yhat'][-len(test):])
# Repeat the process for DOLLAR_SALES
prophet_model_dollar = Prophet(yearly_seasonality=True, weekly_seasonality=True)
prophet_model_dollar.fit(train[['ds', 'y']]) # Ensure 'ds' and 'y' columns are selected
future_dollar = prophet_model_dollar.make_future_dataframe(periods=len(test))
forecast_dollar = prophet_model_dollar.predict(future_dollar)
mae_dollar = mean_absolute_error(test['DOLLAR_SALES'], forecast_dollar['yhat'][-len(test):])
mse_dollar = mean_squared_error(test['DOLLAR_SALES'], forecast_dollar['yhat'][-len(test):])
INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this. DEBUG:cmdstanpy:input tempfile: /tmp/tmpu6u1ud2o/9fr3okji.json DEBUG:cmdstanpy:input tempfile: /tmp/tmpu6u1ud2o/oo83c7fy.json DEBUG:cmdstanpy:idx 0 DEBUG:cmdstanpy:running CmdStan, num_threads: None DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.10/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=12612', 'data', 'file=/tmp/tmpu6u1ud2o/9fr3okji.json', 'init=/tmp/tmpu6u1ud2o/oo83c7fy.json', 'output', 'file=/tmp/tmpu6u1ud2o/prophet_model3kybzigo/prophet_model-20240331041229.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000'] 04:12:29 - cmdstanpy - INFO - Chain [1] start processing INFO:cmdstanpy:Chain [1] start processing 04:12:29 - cmdstanpy - INFO - Chain [1] done processing INFO:cmdstanpy:Chain [1] done processing INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this. DEBUG:cmdstanpy:input tempfile: /tmp/tmpu6u1ud2o/zznb9jj_.json DEBUG:cmdstanpy:input tempfile: /tmp/tmpu6u1ud2o/ng9ybh30.json DEBUG:cmdstanpy:idx 0 DEBUG:cmdstanpy:running CmdStan, num_threads: None DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.10/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=70447', 'data', 'file=/tmp/tmpu6u1ud2o/zznb9jj_.json', 'init=/tmp/tmpu6u1ud2o/ng9ybh30.json', 'output', 'file=/tmp/tmpu6u1ud2o/prophet_modelgdpy4abj/prophet_model-20240331041229.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000'] 04:12:29 - cmdstanpy - INFO - Chain [1] start processing INFO:cmdstanpy:Chain [1] start processing 04:12:29 - cmdstanpy - INFO - Chain [1] done processing INFO:cmdstanpy:Chain [1] done processing
# Print the evaluation metrics
print(f'UNIT_SALES - MAE: {mae_unit}, MSE: {mse_unit}')
print(f'DOLLAR_SALES - MAE: {mae_dollar}, MSE: {mse_dollar}')
UNIT_SALES - MAE: 84142.2759724092, MSE: 11464253863.828045 DOLLAR_SALES - MAE: 4927800.661033819, MSE: 24365389117751.88
The MAE and MSE values for unit sales are 84142 and 11464253863. For dollor sales the respected values are 4927800 and 24365389117751.
From this model evaluations we can say that these values are high because we only considered the Northern region which contains low sales that's why the data is not picking up the consistent pattern in the sales. But the forecast sales are also low in the Northern region in numbers around 5 units and dollar sales around 30 dollars.
We then filter package '.5L 12One Jug', category 'Ing Enhanced Water' with 'Swire-cc' manufacturer in South Regions
Before building the model to forecast the sales, let's check the data and find the similar products in the given dataset. Since, the dataset is large we use Google Big query to filter the data based on the requirements and import those filtered data to this notebook.
In the dataset provided to us we don't have dataset with combinations of Flavor: 'Woodsy yellow'. So, we fist consider the other Package : '.5L 12one jug' Market Category: ing enhanced water, Manufacturer : Swire-CC in south region
from google.colab import auth
from google.cloud import bigquery
from google.colab import data_table
project = 'spring-swire-ca' # Project ID inserted based on the query results selected to explore
location = 'US' # Location inserted based on the query results selected to explore
client = bigquery.Client(project=project, location=location)
data_table.enable_dataframe_formatter()
auth.authenticate_user()
# This code will display the query used to generate your previous job.
job = client.get_job('bquxjob_53d98272_18e9377e43f') # Job ID inserted based on the query results selected to explore
print(job.query)
SELECT
fmd.DATE,
SUM(fmd.UNIT_SALES) AS UNIT_SALES,
SUM(fmd.DOLLAR_SALES) AS DOLLAR_SALES
FROM `swirecc.fact_market_demand` fmd
JOIN (
SELECT DISTINCT zm.MARKET_KEY
FROM `swirecc.zip_to_market_unit_mapping` zm
LEFT JOIN `swirecc.consumer_demographics` cd
ON cd.Zip = zm.ZIP_CODE
WHERE cd.State IN ('KS', 'UT', 'CA', 'CO', 'AZ', 'NM', 'NV')
) AS distinct_market_keys
ON fmd.MARKET_KEY = distinct_market_keys.MARKET_KEY
WHERE fmd.CATEGORY = 'ING ENHANCED WATER'
AND fmd.MANUFACTURER = 'SWIRE-CC'
AND fmd.PACKAGE = '.5L 12ONE JUG'
GROUP BY
fmd.DATE;
# This code will read results from your previous job
job = client.get_job('bquxjob_53d98272_18e9377e43f') # Job ID inserted based on the query results selected to explore
results = job.to_dataframe()
results
| DATE | UNIT_SALES | DOLLAR_SALES | |
|---|---|---|---|
| 0 | 2021-02-13 | 13.0 | 29.93 |
| 1 | 2021-03-27 | 13.0 | 30.41 |
| 2 | 2021-09-11 | 7.0 | 17.73 |
| 3 | 2021-10-30 | 2.0 | 26.37 |
| 4 | 2021-06-12 | 12.0 | 29.98 |
| ... | ... | ... | ... |
| 113 | 2023-04-29 | 1.0 | 2.89 |
| 114 | 2022-12-03 | 3.0 | 8.77 |
| 115 | 2023-06-03 | 1.0 | 2.89 |
| 116 | 2023-02-11 | 3.0 | 8.67 |
| 117 | 2022-06-11 | 4.0 | 10.76 |
118 rows × 3 columns
We get the data from the google big query to this notebook and next we make changes to the dataframe 'result' by changing the 'DATE' format and dividing it into year, month and week.
import pandas as pd
# Convert 'DATE' column to datetime format
results['DATE'] = pd.to_datetime(results['DATE'])
# Extract relevant features for forecasting
forecast_features = results[['DATE', 'UNIT_SALES', 'DOLLAR_SALES']]
# Add additional time-related features
forecast_features['YEAR'] = forecast_features['DATE'].dt.year
forecast_features['MONTH'] = forecast_features['DATE'].dt.month
forecast_features['WEEK_OF_YEAR'] = forecast_features['DATE'].dt.isocalendar().week
# Display the prepared forecasting features
print("Forecasting Features:")
forecast_features
Forecasting Features:
| DATE | UNIT_SALES | DOLLAR_SALES | YEAR | MONTH | WEEK_OF_YEAR | |
|---|---|---|---|---|---|---|
| 0 | 2021-02-13 | 13.0 | 29.93 | 2021 | 2 | 6 |
| 1 | 2021-03-27 | 13.0 | 30.41 | 2021 | 3 | 12 |
| 2 | 2021-09-11 | 7.0 | 17.73 | 2021 | 9 | 36 |
| 3 | 2021-10-30 | 2.0 | 26.37 | 2021 | 10 | 43 |
| 4 | 2021-06-12 | 12.0 | 29.98 | 2021 | 6 | 23 |
| ... | ... | ... | ... | ... | ... | ... |
| 113 | 2023-04-29 | 1.0 | 2.89 | 2023 | 4 | 17 |
| 114 | 2022-12-03 | 3.0 | 8.77 | 2022 | 12 | 48 |
| 115 | 2023-06-03 | 1.0 | 2.89 | 2023 | 6 | 22 |
| 116 | 2023-02-11 | 3.0 | 8.67 | 2023 | 2 | 6 |
| 117 | 2022-06-11 | 4.0 | 10.76 | 2022 | 6 | 23 |
118 rows × 6 columns
We follow the same pattern for the import of the dataset from the google big query to this notebook and exporting year and month and week from the 'DATE' column throughout the notebook
Prophet is a procedure for forecasting time series data based on an additive model where non-linear trends are fit with yearly, weekly, and daily seasonality, plus holiday effects. It works best with time series that have strong seasonal effects and several seasons of historical data. Prophet is robust to missing data and shifts in the trend, and typically handles outliers well.
import pandas as pd
from prophet import Prophet
import matplotlib.pyplot as plt
# Convert the 'DATE' column to datetime and sort by date
forecast_features['DATE'] = pd.to_datetime(forecast_features['DATE'])
forecast_features.sort_values('DATE', inplace=True)
# Prepare the DataFrame for Prophet's convention
df_prophet_unit = forecast_features[['DATE', 'UNIT_SALES']].rename(columns={'DATE': 'ds', 'UNIT_SALES': 'y'})
df_prophet_dollar = forecast_features[['DATE', 'DOLLAR_SALES']].rename(columns={'DATE': 'ds', 'DOLLAR_SALES': 'y'})
# Fit the Prophet model for unit sales
prophet_model_unit = Prophet(yearly_seasonality=True, weekly_seasonality=True)
prophet_model_unit.fit(df_prophet_unit)
# Fit the Prophet model for dollar sales
prophet_model_dollar = Prophet(yearly_seasonality=True, weekly_seasonality=True)
prophet_model_dollar.fit(df_prophet_dollar)
# Create a future dataframe for one year and make predictions
future_unit = prophet_model_unit.make_future_dataframe(periods=365)
future_dollar = prophet_model_dollar.make_future_dataframe(periods=365)
forecast_unit = prophet_model_unit.predict(future_unit)
forecast_dollar = prophet_model_dollar.predict(future_dollar)
# Get the last historical date
last_historical_date = df_prophet_unit['ds'].max()
# Function to find the best 13 weeks within the forecast period
def find_best_13_weeks(forecast, last_historical_date):
# Restrict to forecasted data after the last historical date
forecast_future = forecast[forecast['ds'] > last_historical_date]
forecast_future['rolling_sum'] = forecast_future['yhat'].rolling(window=13, min_periods=1).sum()
best_period_idx = forecast_future['rolling_sum'].idxmax()
best_period_start = forecast_future.loc[best_period_idx]['ds']
best_period_end = forecast_future.loc[best_period_idx]['ds'] + pd.DateOffset(weeks=12) # Include the end week
return best_period_start, best_period_end
# Find the best 13 weeks for unit sales and dollar sales
best_start_unit, best_end_unit = find_best_13_weeks(forecast_unit, last_historical_date)
best_start_dollar, best_end_dollar = find_best_13_weeks(forecast_dollar, last_historical_date)
# Plotting function
def plot_prophet_forecast_with_highlights(model, forecast, best_start, best_end, title):
fig = model.plot(forecast)
plt.axvspan(best_start, best_end, color='orange', alpha=0.3, label='Best 13 Weeks')
plt.title(title)
plt.xlabel('Date')
plt.ylabel('Sales')
plt.legend()
plt.show()
# Plot the forecasts with the best 13 weeks highlighted
plot_prophet_forecast_with_highlights(prophet_model_unit, forecast_unit, best_start_unit, best_end_unit, 'Prophet Forecast for Unit Sales with Best 13 Weeks Highlighted')
plot_prophet_forecast_with_highlights(prophet_model_dollar, forecast_dollar, best_start_dollar, best_end_dollar, 'Prophet Forecast for Dollar Sales with Best 13 Weeks Highlighted')
INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this. DEBUG:cmdstanpy:input tempfile: /tmp/tmpu6u1ud2o/9hl2aii_.json DEBUG:cmdstanpy:input tempfile: /tmp/tmpu6u1ud2o/4ru1jqhn.json DEBUG:cmdstanpy:idx 0 DEBUG:cmdstanpy:running CmdStan, num_threads: None DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.10/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=16511', 'data', 'file=/tmp/tmpu6u1ud2o/9hl2aii_.json', 'init=/tmp/tmpu6u1ud2o/4ru1jqhn.json', 'output', 'file=/tmp/tmpu6u1ud2o/prophet_modely9aefsr6/prophet_model-20240331074543.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000'] 07:45:43 - cmdstanpy - INFO - Chain [1] start processing INFO:cmdstanpy:Chain [1] start processing 07:45:43 - cmdstanpy - INFO - Chain [1] done processing INFO:cmdstanpy:Chain [1] done processing INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this. DEBUG:cmdstanpy:input tempfile: /tmp/tmpu6u1ud2o/t1f4i86c.json DEBUG:cmdstanpy:input tempfile: /tmp/tmpu6u1ud2o/l98sphy3.json DEBUG:cmdstanpy:idx 0 DEBUG:cmdstanpy:running CmdStan, num_threads: None DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.10/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=26575', 'data', 'file=/tmp/tmpu6u1ud2o/t1f4i86c.json', 'init=/tmp/tmpu6u1ud2o/l98sphy3.json', 'output', 'file=/tmp/tmpu6u1ud2o/prophet_modelx_zh0y7h/prophet_model-20240331074543.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000'] 07:45:43 - cmdstanpy - INFO - Chain [1] start processing INFO:cmdstanpy:Chain [1] start processing 07:45:43 - cmdstanpy - INFO - Chain [1] done processing INFO:cmdstanpy:Chain [1] done processing
From this plot we can see that the best 13 weeks for unit sales are from october to january and dollor sales are from may to july.
Lets evaluate the model performance metrics.
# Splitting the data into train and test sets
split_point = int(len(forecast_features) * 0.8) # for an 80/20 split
train = forecast_features.iloc[:split_point].copy() # Make a copy to avoid modifying the original DataFrame
test = forecast_features.iloc[split_point:].copy() # Make a copy to avoid modifying the original DataFrame
# Reset index to ensure 'DATE' is a regular column
train.reset_index(inplace=True)
test.reset_index(inplace=True)
# Ensure that the DataFrame has the columns 'DATE' and 'UNIT_SALES' or 'DOLLAR_SALES'
# If you have different column names, replace 'UNIT_SALES' or 'DOLLAR_SALES' accordingly
train.rename(columns={'DATE': 'ds', 'UNIT_SALES': 'y'}, inplace=True) # for UNIT_SALES forecasting
# or
# train.rename(columns={'DATE': 'ds', 'DOLLAR_SALES': 'y'}, inplace=True) # for DOLLAR_SALES forecasting
# Fit the Prophet model for UNIT_SALES on the training set
prophet_model_unit = Prophet(yearly_seasonality=True, weekly_seasonality=True)
prophet_model_unit.fit(train[['ds', 'y']]) # Ensure 'ds' and 'y' columns are selected
# Generate forecasts for the test set period
future_unit = prophet_model_unit.make_future_dataframe(periods=len(test))
forecast_unit = prophet_model_unit.predict(future_unit)
# Calculate MAE and MSE for UNIT_SALES
mae_unit = mean_absolute_error(test['UNIT_SALES'], forecast_unit['yhat'][-len(test):])
mse_unit = mean_squared_error(test['UNIT_SALES'], forecast_unit['yhat'][-len(test):])
# Repeat the process for DOLLAR_SALES
prophet_model_dollar = Prophet(yearly_seasonality=True, weekly_seasonality=True)
prophet_model_dollar.fit(train[['ds', 'y']]) # Ensure 'ds' and 'y' columns are selected
future_dollar = prophet_model_dollar.make_future_dataframe(periods=len(test))
forecast_dollar = prophet_model_dollar.predict(future_dollar)
mae_dollar = mean_absolute_error(test['DOLLAR_SALES'], forecast_dollar['yhat'][-len(test):])
mse_dollar = mean_squared_error(test['DOLLAR_SALES'], forecast_dollar['yhat'][-len(test):])
INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this. DEBUG:cmdstanpy:input tempfile: /tmp/tmpu6u1ud2o/zxxuy1kw.json DEBUG:cmdstanpy:input tempfile: /tmp/tmpu6u1ud2o/kl51j3h0.json DEBUG:cmdstanpy:idx 0 DEBUG:cmdstanpy:running CmdStan, num_threads: None DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.10/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=52808', 'data', 'file=/tmp/tmpu6u1ud2o/zxxuy1kw.json', 'init=/tmp/tmpu6u1ud2o/kl51j3h0.json', 'output', 'file=/tmp/tmpu6u1ud2o/prophet_model96fwf2qa/prophet_model-20240331074545.csv', 'method=optimize', 'algorithm=newton', 'iter=10000'] 07:45:45 - cmdstanpy - INFO - Chain [1] start processing INFO:cmdstanpy:Chain [1] start processing 07:45:46 - cmdstanpy - INFO - Chain [1] done processing INFO:cmdstanpy:Chain [1] done processing INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this. DEBUG:cmdstanpy:input tempfile: /tmp/tmpu6u1ud2o/fv92efm5.json DEBUG:cmdstanpy:input tempfile: /tmp/tmpu6u1ud2o/e2khrykp.json DEBUG:cmdstanpy:idx 0 DEBUG:cmdstanpy:running CmdStan, num_threads: None DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.10/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=7373', 'data', 'file=/tmp/tmpu6u1ud2o/fv92efm5.json', 'init=/tmp/tmpu6u1ud2o/e2khrykp.json', 'output', 'file=/tmp/tmpu6u1ud2o/prophet_modelj44y2p3o/prophet_model-20240331074546.csv', 'method=optimize', 'algorithm=newton', 'iter=10000'] 07:45:46 - cmdstanpy - INFO - Chain [1] start processing INFO:cmdstanpy:Chain [1] start processing 07:45:46 - cmdstanpy - INFO - Chain [1] done processing INFO:cmdstanpy:Chain [1] done processing
# Print the evaluation metrics
print(f'UNIT_SALES - MAE: {mae_unit}, MSE: {mse_unit}')
print(f'DOLLAR_SALES - MAE: {mae_dollar}, MSE: {mse_dollar}')
UNIT_SALES - MAE: 14.760235496932111, MSE: 243.83874523289515 DOLLAR_SALES - MAE: 19.94565216359878, MSE: 471.88964169847605
The MAE and MSE values for unit sales are 14 and 243 . For dollor sales the respected values are 19 and 471.
When compared to the Northern region, the Southern region has high sales and the MAE and MSE values are low which infers that the model is good. But, the sales in the Southern region is also decreasing over the years.
Since, the datapoints are not available and the right combination of package type is not available, we refer to the non - swire products to further analysis.
We now filter package '.5L 12One Jug', category 'Ing Enhanced Water' with 'Non Swire-cc' manufacturer in North Regions
Before building the model to forecast the sales, let's check the data and find the similar products in the given dataset. Since, the dataset is large we use Google Big query to filter the data based on the requirements and import those filtered data to this notebook.
In the dataset provided to us we don't have dataset with combinations of Flavor: 'Woodsy yellow'. So, we fist consider the other Package : '.5L 12one jug' Market Category: ing enhanced water, Manufacturer is not Swire-CC in north region
from google.colab import auth
from google.cloud import bigquery
from google.colab import data_table
project = 'spring-swire-ca' # Project ID inserted based on the query results selected to explore
location = 'US' # Location inserted based on the query results selected to explore
client = bigquery.Client(project=project, location=location)
data_table.enable_dataframe_formatter()
auth.authenticate_user()
# This code will display the query used to generate your previous job.
job = client.get_job('bquxjob_70b31c_18e937d73eb') # Job ID inserted based on the query results selected to explore
print(job.query)
SELECT
fmd.DATE,
SUM(fmd.UNIT_SALES) AS UNIT_SALES,
SUM(fmd.DOLLAR_SALES) AS DOLLAR_SALES
FROM `swirecc.fact_market_demand` fmd
JOIN (
SELECT DISTINCT zm.MARKET_KEY
FROM `swirecc.zip_to_market_unit_mapping` zm
LEFT JOIN `swirecc.consumer_demographics` cd
ON cd.Zip = zm.ZIP_CODE
WHERE cd.State NOT IN ('KS', 'UT', 'CA', 'CO', 'AZ', 'NM', 'NV')
) AS distinct_market_keys
ON fmd.MARKET_KEY = distinct_market_keys.MARKET_KEY
WHERE fmd.CATEGORY = 'ING ENHANCED WATER'
AND fmd.MANUFACTURER != 'SWIRE-CC'
AND fmd.PACKAGE = '.5L 12ONE JUG'
GROUP BY
fmd.DATE;
# This code will read results from your previous job
job = client.get_job('bquxjob_70b31c_18e937d73eb') # Job ID inserted based on the query results selected to explore
results = job.to_dataframe()
results
| DATE | UNIT_SALES | DOLLAR_SALES | |
|---|---|---|---|
| 0 | 2022-04-23 | 11404.0 | 87848.49 |
| 1 | 2023-06-10 | 12910.0 | 116092.66 |
| 2 | 2021-05-29 | 9963.0 | 66150.83 |
| 3 | 2021-10-30 | 7882.0 | 53111.21 |
| 4 | 2023-04-08 | 11684.0 | 104108.96 |
| ... | ... | ... | ... |
| 143 | 2021-07-17 | 12477.0 | 83090.46 |
| 144 | 2021-04-24 | 8444.0 | 56115.15 |
| 145 | 2021-04-03 | 7517.0 | 49721.32 |
| 146 | 2022-01-01 | 6378.0 | 49240.57 |
| 147 | 2021-11-06 | 7643.0 | 51266.10 |
148 rows × 3 columns
We get the data from the google big query to this notebook and next we make changes to the dataframe 'result' by changing the 'DATE' format and dividing it into year, month and week.
import pandas as pd
# Converting 'DATE' column to datetime format
results['DATE'] = pd.to_datetime(results['DATE'])
# Extract relevant features for forecasting
forecast_features = results[['DATE', 'UNIT_SALES', 'DOLLAR_SALES']]
# Add additional time-related features
forecast_features['YEAR'] = forecast_features['DATE'].dt.year
forecast_features['MONTH'] = forecast_features['DATE'].dt.month
forecast_features['WEEK_OF_YEAR'] = forecast_features['DATE'].dt.isocalendar().week
# Display the prepared forecasting features
print("Forecasting Features:")
forecast_features
Forecasting Features:
| DATE | UNIT_SALES | DOLLAR_SALES | YEAR | MONTH | WEEK_OF_YEAR | |
|---|---|---|---|---|---|---|
| 0 | 2022-04-23 | 11404.0 | 87848.49 | 2022 | 4 | 16 |
| 1 | 2023-06-10 | 12910.0 | 116092.66 | 2023 | 6 | 23 |
| 2 | 2021-05-29 | 9963.0 | 66150.83 | 2021 | 5 | 21 |
| 3 | 2021-10-30 | 7882.0 | 53111.21 | 2021 | 10 | 43 |
| 4 | 2023-04-08 | 11684.0 | 104108.96 | 2023 | 4 | 14 |
| ... | ... | ... | ... | ... | ... | ... |
| 143 | 2021-07-17 | 12477.0 | 83090.46 | 2021 | 7 | 28 |
| 144 | 2021-04-24 | 8444.0 | 56115.15 | 2021 | 4 | 16 |
| 145 | 2021-04-03 | 7517.0 | 49721.32 | 2021 | 4 | 13 |
| 146 | 2022-01-01 | 6378.0 | 49240.57 | 2022 | 1 | 52 |
| 147 | 2021-11-06 | 7643.0 | 51266.10 | 2021 | 11 | 44 |
148 rows × 6 columns
We follow the same pattern for the import of the dataset from the google big query to this notebook and exporting year and month and week from the 'DATE' column throughout the notebook
Prophet is a procedure for forecasting time series data based on an additive model where non-linear trends are fit with yearly, weekly, and daily seasonality, plus holiday effects. It works best with time series that have strong seasonal effects and several seasons of historical data. Prophet is robust to missing data and shifts in the trend, and typically handles outliers well.
import pandas as pd
from prophet import Prophet
import matplotlib.pyplot as plt
# Converting the 'DATE' column to datetime and sort by date
forecast_features['DATE'] = pd.to_datetime(forecast_features['DATE'])
forecast_features.sort_values('DATE', inplace=True)
# Prepare the DataFrame for Prophet's convention
df_prophet_unit = forecast_features[['DATE', 'UNIT_SALES']].rename(columns={'DATE': 'ds', 'UNIT_SALES': 'y'})
df_prophet_dollar = forecast_features[['DATE', 'DOLLAR_SALES']].rename(columns={'DATE': 'ds', 'DOLLAR_SALES': 'y'})
# Fit the Prophet model for unit sales
prophet_model_unit = Prophet(yearly_seasonality=True, weekly_seasonality=True)
prophet_model_unit.fit(df_prophet_unit)
# Fit the Prophet model for dollar sales
prophet_model_dollar = Prophet(yearly_seasonality=True, weekly_seasonality=True)
prophet_model_dollar.fit(df_prophet_dollar)
# Create a future dataframe for one year and make predictions
future_unit = prophet_model_unit.make_future_dataframe(periods=365)
future_dollar = prophet_model_dollar.make_future_dataframe(periods=365)
forecast_unit = prophet_model_unit.predict(future_unit)
forecast_dollar = prophet_model_dollar.predict(future_dollar)
# Get the last historical date
last_historical_date = df_prophet_unit['ds'].max()
# Function to find the best 13 weeks within the forecast period
def find_best_13_weeks(forecast, last_historical_date):
# Restrict to forecasted data after the last historical date
forecast_future = forecast[forecast['ds'] > last_historical_date]
forecast_future['rolling_sum'] = forecast_future['yhat'].rolling(window=13, min_periods=1).sum()
best_period_idx = forecast_future['rolling_sum'].idxmax()
best_period_start = forecast_future.loc[best_period_idx]['ds']
best_period_end = forecast_future.loc[best_period_idx]['ds'] + pd.DateOffset(weeks=12) # Include the end week
return best_period_start, best_period_end
# Find the best 13 weeks for unit sales and dollar sales
best_start_unit, best_end_unit = find_best_13_weeks(forecast_unit, last_historical_date)
best_start_dollar, best_end_dollar = find_best_13_weeks(forecast_dollar, last_historical_date)
# Plotting function
def plot_prophet_forecast_with_highlights(model, forecast, best_start, best_end, title):
fig = model.plot(forecast)
plt.axvspan(best_start, best_end, color='orange', alpha=0.3, label='Best 13 Weeks')
plt.title(title)
plt.xlabel('Date')
plt.ylabel('Sales')
plt.legend()
plt.show()
# Plot the forecasts with the best 13 weeks highlighted
plot_prophet_forecast_with_highlights(prophet_model_unit, forecast_unit, best_start_unit, best_end_unit, 'Prophet Forecast for Unit Sales with Best 13 Weeks Highlighted')
plot_prophet_forecast_with_highlights(prophet_model_dollar, forecast_dollar, best_start_dollar, best_end_dollar, 'Prophet Forecast for Dollar Sales with Best 13 Weeks Highlighted')
INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this. DEBUG:cmdstanpy:input tempfile: /tmp/tmpu6u1ud2o/htx7on99.json DEBUG:cmdstanpy:input tempfile: /tmp/tmpu6u1ud2o/uyby8f3e.json DEBUG:cmdstanpy:idx 0 DEBUG:cmdstanpy:running CmdStan, num_threads: None DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.10/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=53881', 'data', 'file=/tmp/tmpu6u1ud2o/htx7on99.json', 'init=/tmp/tmpu6u1ud2o/uyby8f3e.json', 'output', 'file=/tmp/tmpu6u1ud2o/prophet_model9ikd6hr0/prophet_model-20240331075352.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000'] 07:53:52 - cmdstanpy - INFO - Chain [1] start processing INFO:cmdstanpy:Chain [1] start processing 07:53:52 - cmdstanpy - INFO - Chain [1] done processing INFO:cmdstanpy:Chain [1] done processing INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this. DEBUG:cmdstanpy:input tempfile: /tmp/tmpu6u1ud2o/exz6b0db.json DEBUG:cmdstanpy:input tempfile: /tmp/tmpu6u1ud2o/gpvt6fco.json DEBUG:cmdstanpy:idx 0 DEBUG:cmdstanpy:running CmdStan, num_threads: None DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.10/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=28164', 'data', 'file=/tmp/tmpu6u1ud2o/exz6b0db.json', 'init=/tmp/tmpu6u1ud2o/gpvt6fco.json', 'output', 'file=/tmp/tmpu6u1ud2o/prophet_model5y1cvch8/prophet_model-20240331075352.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000'] 07:53:52 - cmdstanpy - INFO - Chain [1] start processing INFO:cmdstanpy:Chain [1] start processing 07:53:52 - cmdstanpy - INFO - Chain [1] done processing INFO:cmdstanpy:Chain [1] done processing
From this plot we can see that the best 13 weeks for unit sales and dollor sales are from july to october.
Lets evaluate the model perdormance metrics.
# Splitting the data into train and test sets
split_point = int(len(forecast_features) * 0.8) # for an 80/20 split
train = forecast_features.iloc[:split_point].copy()
test = forecast_features.iloc[split_point:].copy()
# Reset index to ensure 'DATE' is a regular column
train.reset_index(inplace=True)
test.reset_index(inplace=True)
train.rename(columns={'DATE': 'ds', 'UNIT_SALES': 'y'}, inplace=True)
# Fit the Prophet model for UNIT_SALES on the training set
prophet_model_unit = Prophet(yearly_seasonality=True, weekly_seasonality=True)
prophet_model_unit.fit(train[['ds', 'y']]) # Ensure 'ds' and 'y' columns are selected
# Generate forecasts for the test set period
future_unit = prophet_model_unit.make_future_dataframe(periods=len(test))
forecast_unit = prophet_model_unit.predict(future_unit)
# Calculate MAE and MSE for UNIT_SALES
mae_unit = mean_absolute_error(test['UNIT_SALES'], forecast_unit['yhat'][-len(test):])
mse_unit = mean_squared_error(test['UNIT_SALES'], forecast_unit['yhat'][-len(test):])
# Repeat the process for DOLLAR_SALES
prophet_model_dollar = Prophet(yearly_seasonality=True, weekly_seasonality=True)
prophet_model_dollar.fit(train[['ds', 'y']]) # Ensure 'ds' and 'y' columns are selected
future_dollar = prophet_model_dollar.make_future_dataframe(periods=len(test))
forecast_dollar = prophet_model_dollar.predict(future_dollar)
mae_dollar = mean_absolute_error(test['DOLLAR_SALES'], forecast_dollar['yhat'][-len(test):])
mse_dollar = mean_squared_error(test['DOLLAR_SALES'], forecast_dollar['yhat'][-len(test):])
INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this. DEBUG:cmdstanpy:input tempfile: /tmp/tmpu6u1ud2o/9fr3okji.json DEBUG:cmdstanpy:input tempfile: /tmp/tmpu6u1ud2o/oo83c7fy.json DEBUG:cmdstanpy:idx 0 DEBUG:cmdstanpy:running CmdStan, num_threads: None DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.10/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=12612', 'data', 'file=/tmp/tmpu6u1ud2o/9fr3okji.json', 'init=/tmp/tmpu6u1ud2o/oo83c7fy.json', 'output', 'file=/tmp/tmpu6u1ud2o/prophet_model3kybzigo/prophet_model-20240331041229.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000'] 04:12:29 - cmdstanpy - INFO - Chain [1] start processing INFO:cmdstanpy:Chain [1] start processing 04:12:29 - cmdstanpy - INFO - Chain [1] done processing INFO:cmdstanpy:Chain [1] done processing INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this. DEBUG:cmdstanpy:input tempfile: /tmp/tmpu6u1ud2o/zznb9jj_.json DEBUG:cmdstanpy:input tempfile: /tmp/tmpu6u1ud2o/ng9ybh30.json DEBUG:cmdstanpy:idx 0 DEBUG:cmdstanpy:running CmdStan, num_threads: None DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.10/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=70447', 'data', 'file=/tmp/tmpu6u1ud2o/zznb9jj_.json', 'init=/tmp/tmpu6u1ud2o/ng9ybh30.json', 'output', 'file=/tmp/tmpu6u1ud2o/prophet_modelgdpy4abj/prophet_model-20240331041229.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000'] 04:12:29 - cmdstanpy - INFO - Chain [1] start processing INFO:cmdstanpy:Chain [1] start processing 04:12:29 - cmdstanpy - INFO - Chain [1] done processing INFO:cmdstanpy:Chain [1] done processing
# Print the evaluation metrics
print(f'UNIT_SALES - MAE: {mae_unit}, MSE: {mse_unit}')
print(f'DOLLAR_SALES - MAE: {mae_dollar}, MSE: {mse_dollar}')
UNIT_SALES - MAE: 84142.2759724092, MSE: 11464253863.828045 DOLLAR_SALES - MAE: 4927800.661033819, MSE: 24365389117751.88
The MAE and MSE values for unit sales are 84142 and 11464253863.For Dollor sales the respected values are 4927800 and 24365389117751.
Exponential smoothing is a forecasting method that uses weighted averages of past observations to predict new values. It is most effective when the values of the time series follow a gradual trend and display seasonal behavior in which the values follow a repeated cyclical pattern over a given number of time steps.
import matplotlib.pyplot as plt
from statsmodels.tsa.holtwinters import ExponentialSmoothing
from sklearn.metrics import mean_absolute_error, mean_squared_error
# Ensure the DATE column is in datetime format and set as the DataFrame's index
forecast_features['DATE'] = pd.to_datetime(forecast_features['DATE'])
forecast_features.set_index('DATE', inplace=True)
# Sort the DataFrame by the datetime index
forecast_features.sort_index(inplace=True)
# Define the last date in the DataFrame
last_date = forecast_features.index.max()
# Prepare the forecast index for the next year after the last date
forecast_index = pd.date_range(start=last_date, periods=53, freq='W')[1:]
# Exponential Smoothing Forecast for UNIT_SALES
exp_model = ExponentialSmoothing(
forecast_features['UNIT_SALES'],
trend='add',
seasonal='add',
seasonal_periods=52
).fit()
exp_forecast = exp_model.forecast(52)
exp_forecast.index = forecast_index
# Exponential Smoothing Forecast for DOLLAR_SALES
exp_model_dollar = ExponentialSmoothing(
forecast_features['DOLLAR_SALES'],
trend='add',
seasonal='add',
seasonal_periods=52
).fit()
exp_forecast_dollar = exp_model_dollar.forecast(52)
exp_forecast_dollar.index = forecast_index
# Function to find the best 13 weeks
def find_best_13_weeks(forecast):
rolling_sum = forecast.rolling(window='182D').sum()
best_period_end = rolling_sum.idxmax()
best_period_start = best_period_end - pd.DateOffset(weeks=13)
return best_period_start, best_period_end
# Find the best 6 months for unit sales
best_start_unit, best_end_unit = find_best_13_weeks(exp_forecast)
# Find the best 6 months for dollar sales
best_start_dollar, best_end_dollar = find_best_13_weeks(exp_forecast_dollar)
# Plotting function
def plot_forecast_with_highlights(forecast, best_start, best_end, title):
plt.figure(figsize=(14, 7))
plt.plot(forecast.index, forecast, label='Forecast')
plt.axvspan(best_start, best_end, color='orange', alpha=0.3, label='Best 13 Weeks')
plt.title(title)
plt.xlabel('Date')
plt.ylabel('Sales')
plt.legend()
plt.show()
# Plot the forecasts with the best thirteen weeks highlighted
plot_forecast_with_highlights(exp_forecast, best_start_unit, best_end_unit, 'Unit Sales Forecast with Best 13 Weeks Highlighted')
plot_forecast_with_highlights(exp_forecast_dollar, best_start_dollar, best_end_dollar, 'Dollar Sales Forecast with Best 13 Weeks Highlighted')
From the plot we can see that the best 13 weeks for unit sales and dollor sales are from july to october.
# Define the function to find the best 13 weeks
def find_best_13_weeks(forecast):
rolling_sum = forecast.rolling(window=13, min_periods=1).sum()
best_period_end = rolling_sum.idxmax()
best_period_start = best_period_end - pd.DateOffset(weeks=12)
return best_period_start, best_period_end, rolling_sum.max()
# Find the best 13 weeks for unit sales
best_start_unit, best_end_unit, max_sales_unit = find_best_13_weeks(exp_forecast)
# Find the best 13 weeks for dollar sales
best_start_dollar, best_end_dollar, max_sales_dollar = find_best_13_weeks(exp_forecast_dollar)
# Output the best periods and total sales
print(f"Best 13 weeks for unit sales start on {best_start_unit.date()} and end on {best_end_unit.date()}, with total sales: {max_sales_unit}")
print(f"Best 13 weeks for dollar sales start on {best_start_dollar.date()} and end on {best_end_dollar.date()}, with total sales: {max_sales_dollar}")
exp_forecast.index.freq = 'W-SUN'
exp_forecast_dollar.index.freq = 'W-SUN'
# Now, let's find the values and months for the best 13 weeks
best_13_weeks_values_unit = exp_forecast.loc[best_start_unit:best_end_unit]
best_13_weeks_values_dollar = exp_forecast_dollar.loc[best_start_dollar:best_end_dollar]
# Print out the results
print("Best 13 weeks for Unit Sales:")
print(best_13_weeks_values_unit)
print("\nBest 13 weeks for Dollar Sales:")
print(best_13_weeks_values_dollar)
# Extracting the month names for visualization
best_months_unit = best_13_weeks_values_unit.index.month_name().unique()
best_months_dollar = best_13_weeks_values_dollar.index.month_name().unique()
print("\nBest months for Unit Sales within the 13-week period:")
print(best_months_unit)
print("\nBest months for Dollar Sales within the 13-week period:")
print(best_months_dollar)
Best 13 weeks for unit sales start on 2024-07-14 and end on 2024-10-06, with total sales: 190316.1355028812 Best 13 weeks for dollar sales start on 2024-07-14 and end on 2024-10-06, with total sales: 1682082.4189948225 Best 13 weeks for Unit Sales: 2024-07-14 14310.653500 2024-07-21 15523.281384 2024-07-28 17459.010526 2024-08-04 16674.040234 2024-08-11 15631.792997 2024-08-18 15306.547874 2024-08-25 15678.845612 2024-09-01 13125.795352 2024-09-08 13452.970459 2024-09-15 12497.527782 2024-09-22 13325.562461 2024-09-29 12780.805251 2024-10-06 14549.302071 Freq: W-SUN, dtype: float64 Best 13 weeks for Dollar Sales: 2024-07-14 127282.602900 2024-07-21 135063.109120 2024-07-28 149215.453788 2024-08-04 144120.410291 2024-08-11 137306.625192 2024-08-18 133775.094500 2024-08-25 135460.550054 2024-09-01 116160.568178 2024-09-08 121629.692055 2024-09-15 114710.283040 2024-09-22 120714.608959 2024-09-29 116253.996762 2024-10-06 130389.424156 Freq: W-SUN, dtype: float64 Best months for Unit Sales within the 13-week period: Index(['July', 'August', 'September', 'October'], dtype='object') Best months for Dollar Sales within the 13-week period: Index(['July', 'August', 'September', 'October'], dtype='object')
The total unit sales of these products in these 13 weeks are 190316 and the dollar sales are 1682082.
# Splitting the data into train and test sets
split_point = int(len(forecast_features) * 0.8) # for an 80/20 split
train = forecast_features.iloc[:split_point]
test = forecast_features.iloc[split_point:]
# Define the parameters for the Exponential Smoothing model
params = {'trend': 'add', 'seasonal': 'add', 'seasonal_periods': 52}
# Fit the model on the training set for UNIT_SALES
exp_model_unit = ExponentialSmoothing(train['UNIT_SALES'], **params).fit()
# Generate forecasts for the test set period
unit_sales_forecast = exp_model_unit.forecast(len(test))
# Calculate MAE and MSE for UNIT_SALES
mae_unit = mean_absolute_error(test['UNIT_SALES'], unit_sales_forecast)
mse_unit = mean_squared_error(test['UNIT_SALES'], unit_sales_forecast)
# Repeat the process for DOLLAR_SALES
exp_model_dollar = ExponentialSmoothing(train['DOLLAR_SALES'], **params).fit()
dollar_sales_forecast = exp_model_dollar.forecast(len(test))
mae_dollar = mean_absolute_error(test['DOLLAR_SALES'], dollar_sales_forecast)
mse_dollar = mean_squared_error(test['DOLLAR_SALES'], dollar_sales_forecast)
# Print the evaluation metrics
print(f'UNIT_SALES - MAE: {mae_unit}, MSE: {mse_unit}')
print(f'DOLLAR_SALES - MAE: {mae_dollar}, MSE: {mse_dollar}')
UNIT_SALES - MAE: 4711.606597501726, MSE: 26676303.715555746 DOLLAR_SALES - MAE: 32546.860312665805, MSE: 1267153502.8132436
The MAE and MSE values for unit sales are 4711 and 26676303. For dollor sales the respected values are 32546 and 1267153502.
The MAE values are decreased when compared to other models.So this is a quite good model.
From the exponential smoothing model, the non- swire cc manufactures who sell products consists of the package '.5L 12One Jug'and category 'Ing Enhanced Water' in Northern region the best 13 weeks are July to October with sales 190316 and the dollar sales are 1682082.
We now filter package '.5L 12One Jug', category 'Ing Enhanced Water' with 'Non Swire-cc' manufacturer in South Regions
Before building the model to forecast the sales, let's check the data and find the similar products in the given dataset. Since, the dataset is large we use Google Big query to filter the data based on the requirements and import those filtered data to this notebook.
In the dataset provided to us we don't have dataset with combinations of Flavor: 'Woodsy yellow'. So, we fist consider the other Package : '.5L 12one jug' Market Category: ing enhanced water, Manufacturer is not Swire-CC in south region
from google.colab import auth
from google.cloud import bigquery
from google.colab import data_table
project = 'spring-swire-ca' # Project ID inserted based on the query results selected to explore
location = 'US' # Location inserted based on the query results selected to explore
client = bigquery.Client(project=project, location=location)
data_table.enable_dataframe_formatter()
auth.authenticate_user()
# This code will display the query used to generate your previous job.
job = client.get_job('bquxjob_3930fb1e_18e9383de9e') # Job ID inserted based on the query results selected to explore
print(job.query)
SELECT
fmd.DATE,
SUM(fmd.UNIT_SALES) AS UNIT_SALES,
SUM(fmd.DOLLAR_SALES) AS DOLLAR_SALES
FROM `swirecc.fact_market_demand` fmd
JOIN (
SELECT DISTINCT zm.MARKET_KEY
FROM `swirecc.zip_to_market_unit_mapping` zm
LEFT JOIN `swirecc.consumer_demographics` cd
ON cd.Zip = zm.ZIP_CODE
WHERE cd.State IN ('KS', 'UT', 'CA', 'CO', 'AZ', 'NM', 'NV')
) AS distinct_market_keys
ON fmd.MARKET_KEY = distinct_market_keys.MARKET_KEY
WHERE fmd.CATEGORY = 'ING ENHANCED WATER'
AND fmd.MANUFACTURER != 'SWIRE-CC'
AND fmd.PACKAGE = '.5L 12ONE JUG'
GROUP BY
fmd.DATE;
# This code will read results from your previous job
job = client.get_job('bquxjob_3930fb1e_18e9383de9e') # Job ID inserted based on the query results selected to explore
results = job.to_dataframe()
results
| DATE | UNIT_SALES | DOLLAR_SALES | |
|---|---|---|---|
| 0 | 2022-11-12 | 33680.0 | 269639.90 |
| 1 | 2022-05-14 | 39223.0 | 301716.69 |
| 2 | 2023-07-01 | 35752.0 | 308037.89 |
| 3 | 2021-11-13 | 24727.0 | 162735.79 |
| 4 | 2021-09-25 | 33956.0 | 222055.26 |
| ... | ... | ... | ... |
| 143 | 2022-04-30 | 35650.0 | 275004.30 |
| 144 | 2022-10-01 | 35831.0 | 287711.35 |
| 145 | 2022-05-28 | 36645.0 | 274302.38 |
| 146 | 2021-12-11 | 26730.0 | 194410.59 |
| 147 | 2021-08-07 | 33884.0 | 222792.13 |
148 rows × 3 columns
We get the data from the google big query to this notebook and next we make changes to the dataframe 'result' by changing the 'DATE' format and dividing it into year, month and week.
import pandas as pd
# Convert 'DATE' column to datetime format
results['DATE'] = pd.to_datetime(results['DATE'])
# Extract relevant features for forecasting
forecast_features = results[['DATE', 'UNIT_SALES', 'DOLLAR_SALES']]
# Add additional time-related features
forecast_features['YEAR'] = forecast_features['DATE'].dt.year
forecast_features['MONTH'] = forecast_features['DATE'].dt.month
forecast_features['WEEK_OF_YEAR'] = forecast_features['DATE'].dt.isocalendar().week
# Display the prepared forecasting features
print("Forecasting Features:")
forecast_features
Forecasting Features:
| DATE | UNIT_SALES | DOLLAR_SALES | YEAR | MONTH | WEEK_OF_YEAR | |
|---|---|---|---|---|---|---|
| 0 | 2022-11-12 | 33680.0 | 269639.90 | 2022 | 11 | 45 |
| 1 | 2022-05-14 | 39223.0 | 301716.69 | 2022 | 5 | 19 |
| 2 | 2023-07-01 | 35752.0 | 308037.89 | 2023 | 7 | 26 |
| 3 | 2021-11-13 | 24727.0 | 162735.79 | 2021 | 11 | 45 |
| 4 | 2021-09-25 | 33956.0 | 222055.26 | 2021 | 9 | 38 |
| ... | ... | ... | ... | ... | ... | ... |
| 143 | 2022-04-30 | 35650.0 | 275004.30 | 2022 | 4 | 17 |
| 144 | 2022-10-01 | 35831.0 | 287711.35 | 2022 | 10 | 39 |
| 145 | 2022-05-28 | 36645.0 | 274302.38 | 2022 | 5 | 21 |
| 146 | 2021-12-11 | 26730.0 | 194410.59 | 2021 | 12 | 49 |
| 147 | 2021-08-07 | 33884.0 | 222792.13 | 2021 | 8 | 31 |
148 rows × 6 columns
We follow the same pattern for the import of the dataset from the google big query to this notebook and exporting year and month and week from the 'DATE' column throughout the notebook
Prophet is a procedure for forecasting time series data based on an additive model where non-linear trends are fit with yearly, weekly, and daily seasonality, plus holiday effects. It works best with time series that have strong seasonal effects and several seasons of historical data. Prophet is robust to missing data and shifts in the trend, and typically handles outliers well.
import pandas as pd
from prophet import Prophet
import matplotlib.pyplot as plt
# Convert the 'DATE' column to datetime and sort by date
forecast_features['DATE'] = pd.to_datetime(forecast_features['DATE'])
forecast_features.sort_values('DATE', inplace=True)
# Prepare the DataFrame for Prophet's convention
df_prophet_unit = forecast_features[['DATE', 'UNIT_SALES']].rename(columns={'DATE': 'ds', 'UNIT_SALES': 'y'})
df_prophet_dollar = forecast_features[['DATE', 'DOLLAR_SALES']].rename(columns={'DATE': 'ds', 'DOLLAR_SALES': 'y'})
# Fit the Prophet model for unit sales
prophet_model_unit = Prophet(yearly_seasonality=True, weekly_seasonality=True)
prophet_model_unit.fit(df_prophet_unit)
# Fit the Prophet model for dollar sales
prophet_model_dollar = Prophet(yearly_seasonality=True, weekly_seasonality=True)
prophet_model_dollar.fit(df_prophet_dollar)
# Create a future dataframe for one year and make predictions
future_unit = prophet_model_unit.make_future_dataframe(periods=365)
future_dollar = prophet_model_dollar.make_future_dataframe(periods=365)
forecast_unit = prophet_model_unit.predict(future_unit)
forecast_dollar = prophet_model_dollar.predict(future_dollar)
# Get the last historical date
last_historical_date = df_prophet_unit['ds'].max()
# Function to find the best 13 weeks within the forecast period
def find_best_13_weeks(forecast, last_historical_date):
# Restrict to forecasted data after the last historical date
forecast_future = forecast[forecast['ds'] > last_historical_date]
forecast_future['rolling_sum'] = forecast_future['yhat'].rolling(window=13, min_periods=1).sum()
best_period_idx = forecast_future['rolling_sum'].idxmax()
best_period_start = forecast_future.loc[best_period_idx]['ds']
best_period_end = forecast_future.loc[best_period_idx]['ds'] + pd.DateOffset(weeks=12) # Include the end week
return best_period_start, best_period_end
# Find the best 13 weeks for unit sales and dollar sales
best_start_unit, best_end_unit = find_best_13_weeks(forecast_unit, last_historical_date)
best_start_dollar, best_end_dollar = find_best_13_weeks(forecast_dollar, last_historical_date)
# Plotting function
def plot_prophet_forecast_with_highlights(model, forecast, best_start, best_end, title):
fig = model.plot(forecast)
plt.axvspan(best_start, best_end, color='orange', alpha=0.3, label='Best 13 Weeks')
plt.title(title)
plt.xlabel('Date')
plt.ylabel('Sales')
plt.legend()
plt.show()
# Plot the forecasts with the best 13 weeks highlighted
plot_prophet_forecast_with_highlights(prophet_model_unit, forecast_unit, best_start_unit, best_end_unit, 'Prophet Forecast for Unit Sales with Best 13 Weeks Highlighted')
plot_prophet_forecast_with_highlights(prophet_model_dollar, forecast_dollar, best_start_dollar, best_end_dollar, 'Prophet Forecast for Dollar Sales with Best 13 Weeks Highlighted')
INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this. DEBUG:cmdstanpy:input tempfile: /tmp/tmpu6u1ud2o/ny1ym660.json DEBUG:cmdstanpy:input tempfile: /tmp/tmpu6u1ud2o/09hcjubg.json DEBUG:cmdstanpy:idx 0 DEBUG:cmdstanpy:running CmdStan, num_threads: None DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.10/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=99743', 'data', 'file=/tmp/tmpu6u1ud2o/ny1ym660.json', 'init=/tmp/tmpu6u1ud2o/09hcjubg.json', 'output', 'file=/tmp/tmpu6u1ud2o/prophet_modelew45gzis/prophet_model-20240331075833.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000'] 07:58:33 - cmdstanpy - INFO - Chain [1] start processing INFO:cmdstanpy:Chain [1] start processing 07:58:33 - cmdstanpy - INFO - Chain [1] done processing INFO:cmdstanpy:Chain [1] done processing INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this. DEBUG:cmdstanpy:input tempfile: /tmp/tmpu6u1ud2o/dl292s72.json DEBUG:cmdstanpy:input tempfile: /tmp/tmpu6u1ud2o/xi_iqncb.json DEBUG:cmdstanpy:idx 0 DEBUG:cmdstanpy:running CmdStan, num_threads: None DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.10/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=72599', 'data', 'file=/tmp/tmpu6u1ud2o/dl292s72.json', 'init=/tmp/tmpu6u1ud2o/xi_iqncb.json', 'output', 'file=/tmp/tmpu6u1ud2o/prophet_model3h8lsevb/prophet_model-20240331075833.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000'] 07:58:34 - cmdstanpy - INFO - Chain [1] start processing INFO:cmdstanpy:Chain [1] start processing 07:58:34 - cmdstanpy - INFO - Chain [1] done processing INFO:cmdstanpy:Chain [1] done processing
From this plot we can see that the best 13 weeks for unit sales and dollor sales are from july to october.
Lets evaluate the model performance metrics.
# Split the data into train and test sets
split_point = int(len(forecast_features) * 0.8) # for an 80/20 split
train = forecast_features.iloc[:split_point].copy()
test = forecast_features.iloc[split_point:].copy()
# Resetting index to ensure 'DATE' is a regular column
train.reset_index(inplace=True)
test.reset_index(inplace=True)
train.rename(columns={'DATE': 'ds', 'UNIT_SALES': 'y'}, inplace=True)
# Fit the Prophet model for UNIT_SALES on the training set
prophet_model_unit = Prophet(yearly_seasonality=True, weekly_seasonality=True)
prophet_model_unit.fit(train[['ds', 'y']]) # Ensure 'ds' and 'y' columns are selected
# Generate forecasts for the test set period
future_unit = prophet_model_unit.make_future_dataframe(periods=len(test))
forecast_unit = prophet_model_unit.predict(future_unit)
# Calculate MAE and MSE for UNIT_SALES
mae_unit = mean_absolute_error(test['UNIT_SALES'], forecast_unit['yhat'][-len(test):])
mse_unit = mean_squared_error(test['UNIT_SALES'], forecast_unit['yhat'][-len(test):])
# Repeat the process for DOLLAR_SALES
prophet_model_dollar = Prophet(yearly_seasonality=True, weekly_seasonality=True)
prophet_model_dollar.fit(train[['ds', 'y']]) # Ensure 'ds' and 'y' columns are selected
future_dollar = prophet_model_dollar.make_future_dataframe(periods=len(test))
forecast_dollar = prophet_model_dollar.predict(future_dollar)
mae_dollar = mean_absolute_error(test['DOLLAR_SALES'], forecast_dollar['yhat'][-len(test):])
mse_dollar = mean_squared_error(test['DOLLAR_SALES'], forecast_dollar['yhat'][-len(test):])
INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this. DEBUG:cmdstanpy:input tempfile: /tmp/tmpu6u1ud2o/yylp_0gg.json DEBUG:cmdstanpy:input tempfile: /tmp/tmpu6u1ud2o/to2wqdnk.json DEBUG:cmdstanpy:idx 0 DEBUG:cmdstanpy:running CmdStan, num_threads: None DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.10/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=70757', 'data', 'file=/tmp/tmpu6u1ud2o/yylp_0gg.json', 'init=/tmp/tmpu6u1ud2o/to2wqdnk.json', 'output', 'file=/tmp/tmpu6u1ud2o/prophet_modelp7yalqgz/prophet_model-20240331075836.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000'] 07:58:36 - cmdstanpy - INFO - Chain [1] start processing INFO:cmdstanpy:Chain [1] start processing 07:58:36 - cmdstanpy - INFO - Chain [1] done processing INFO:cmdstanpy:Chain [1] done processing INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this. DEBUG:cmdstanpy:input tempfile: /tmp/tmpu6u1ud2o/vyemm4sj.json DEBUG:cmdstanpy:input tempfile: /tmp/tmpu6u1ud2o/m9janl1s.json DEBUG:cmdstanpy:idx 0 DEBUG:cmdstanpy:running CmdStan, num_threads: None DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.10/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=23826', 'data', 'file=/tmp/tmpu6u1ud2o/vyemm4sj.json', 'init=/tmp/tmpu6u1ud2o/m9janl1s.json', 'output', 'file=/tmp/tmpu6u1ud2o/prophet_model8yxnb6q_/prophet_model-20240331075836.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000'] 07:58:36 - cmdstanpy - INFO - Chain [1] start processing INFO:cmdstanpy:Chain [1] start processing 07:58:37 - cmdstanpy - INFO - Chain [1] done processing INFO:cmdstanpy:Chain [1] done processing
# Print the evaluation metrics
print(f'UNIT_SALES - MAE: {mae_unit}, MSE: {mse_unit}')
print(f'DOLLAR_SALES - MAE: {mae_dollar}, MSE: {mse_dollar}')
UNIT_SALES - MAE: 4741.745698006632, MSE: 30453680.45680512 DOLLAR_SALES - MAE: 244077.61953977996, MSE: 60052964894.22412
The MAE and MSE values for unit sales are 4741 and 30453680 . For dollor sales the respected values are 244077 and 60052964894.
Exponential smoothing is a forecasting method that uses weighted averages of past observations to predict new values. It is most effective when the values of the time series follow a gradual trend and display seasonal behavior in which the values follow a repeated cyclical pattern over a given number of time steps.
import pandas as pd
import matplotlib.pyplot as plt
from statsmodels.tsa.holtwinters import ExponentialSmoothing
from sklearn.metrics import mean_absolute_error, mean_squared_error
# Sort the DataFrame by the datetime index
forecast_features.sort_index(inplace=True)
# Define the last date in the DataFrame
last_date = forecast_features.index.max()
# Prepare the forecast index for the next year after the last date
forecast_index = pd.date_range(start=last_date, periods=53, freq='W')[1:]
# Exponential Smoothing Forecast for UNIT_SALES
exp_model = ExponentialSmoothing(
forecast_features['UNIT_SALES'],
trend='add',
seasonal='add',
seasonal_periods=52
).fit()
exp_forecast = exp_model.forecast(52)
exp_forecast.index = forecast_index
# Exponential Smoothing Forecast for DOLLAR_SALES
exp_model_dollar = ExponentialSmoothing(
forecast_features['DOLLAR_SALES'],
trend='add',
seasonal='add',
seasonal_periods=52
).fit()
exp_forecast_dollar = exp_model_dollar.forecast(52)
exp_forecast_dollar.index = forecast_index
# Function to find the best 13 weeks
def find_best_13_weeks(forecast):
rolling_sum = forecast.rolling(window='182D').sum()
best_period_end = rolling_sum.idxmax()
best_period_start = best_period_end - pd.DateOffset(weeks=13)
return best_period_start, best_period_end
# Find the best 6 months for unit sales
best_start_unit, best_end_unit = find_best_13_weeks(exp_forecast)
# Find the best 6 months for dollar sales
best_start_dollar, best_end_dollar = find_best_13_weeks(exp_forecast_dollar)
# Plotting function with adjustment for negative values
def plot_forecast_with_highlights(forecast, best_start, best_end, title):
plt.figure(figsize=(14, 7))
# Ensure no negative values in the forecast
forecast_positive = forecast.clip(lower=0)
plt.plot(forecast_positive.index, forecast_positive, label='Forecast')
plt.axvspan(best_start, best_end, color='orange', alpha=0.3, label='Best 13 Weeks')
plt.title(title)
plt.xlabel('Date')
plt.ylabel('Sales')
plt.legend()
plt.show()
# Plot the forecasts with the best thirteen weeks highlighted
plot_forecast_with_highlights(exp_forecast, best_start_unit, best_end_unit, 'Unit Sales Forecast with Best 13 Weeks Highlighted')
plot_forecast_with_highlights(exp_forecast_dollar, best_start_dollar, best_end_dollar, 'Dollar Sales Forecast with Best 13 Weeks Highlighted')
From the plot we can see that the best 13 weeks for unit sales and dollor sales are from july to october.
# Define the function to find the best 13 weeks
def find_best_13_weeks(forecast):
rolling_sum = forecast.rolling(window=13, min_periods=1).sum()
best_period_end = rolling_sum.idxmax()
best_period_start = best_period_end - pd.DateOffset(weeks=12)
return best_period_start, best_period_end, rolling_sum.max()
# Find the best 13 weeks for unit sales
best_start_unit, best_end_unit, max_sales_unit = find_best_13_weeks(exp_forecast)
# Find the best 13 weeks for dollar sales
best_start_dollar, best_end_dollar, max_sales_dollar = find_best_13_weeks(exp_forecast_dollar)
# Output the best periods and total sales
print(f"Best 13 weeks for unit sales start on {best_start_unit.date()} and end on {best_end_unit.date()}, with total sales: {max_sales_unit}")
print(f"Best 13 weeks for dollar sales start on {best_start_dollar.date()} and end on {best_end_dollar.date()}, with total sales: {max_sales_dollar}")
exp_forecast.index.freq = 'W-SUN'
exp_forecast_dollar.index.freq = 'W-SUN'
# Now, let's find the values and months for the best 13 weeks
best_13_weeks_values_unit = exp_forecast.loc[best_start_unit:best_end_unit]
best_13_weeks_values_dollar = exp_forecast_dollar.loc[best_start_dollar:best_end_dollar]
# Print out the results
print("Best 13 weeks for Unit Sales:")
print(best_13_weeks_values_unit)
print("\nBest 13 weeks for Dollar Sales:")
print(best_13_weeks_values_dollar)
# Extracting the month names for visualization
best_months_unit = best_13_weeks_values_unit.index.month_name().unique()
best_months_dollar = best_13_weeks_values_dollar.index.month_name().unique()
print("\nBest months for Unit Sales within the 13-week period:")
print(best_months_unit)
print("\nBest months for Dollar Sales within the 13-week period:")
print(best_months_dollar)
Best 13 weeks for unit sales start on 2024-07-14 and end on 2024-10-06, with total sales: 464594.4540587605 Best 13 weeks for dollar sales start on 2024-07-14 and end on 2024-10-06, with total sales: 4258204.564067395 Best 13 weeks for Unit Sales: 2024-07-14 42443.154399 2024-07-21 37839.956020 2024-07-28 32907.331064 2024-08-04 33482.475829 2024-08-11 32264.235176 2024-08-18 34061.186527 2024-08-25 34373.593740 2024-09-01 34129.642549 2024-09-08 35262.415109 2024-09-15 33552.560289 2024-09-22 36232.678257 2024-09-29 39074.351117 2024-10-06 38970.873982 Freq: W-SUN, dtype: float64 Best 13 weeks for Dollar Sales: 2024-07-14 369900.668096 2024-07-21 340664.648405 2024-07-28 309588.930179 2024-08-04 313950.912343 2024-08-11 306419.112519 2024-08-18 316457.858622 2024-08-25 318355.324297 2024-09-01 317236.414892 2024-09-08 324651.213526 2024-09-15 312820.160369 2024-09-22 330256.981539 2024-09-29 349097.149445 2024-10-06 348805.189837 Freq: W-SUN, dtype: float64 Best months for Unit Sales within the 13-week period: Index(['July', 'August', 'September', 'October'], dtype='object') Best months for Dollar Sales within the 13-week period: Index(['July', 'August', 'September', 'October'], dtype='object')
The total unit sales of these products in these 13 weeks are 464594 and the dollar sales are 4258204.
Lets evaluate the model performance metrics.
# Split the data into train and test sets
split_point = int(len(forecast_features) * 0.8) # for an 80/20 split
train = forecast_features.iloc[:split_point]
test = forecast_features.iloc[split_point:]
# Define the parameters for the Exponential Smoothing model
params = {'trend': 'add', 'seasonal': 'add', 'seasonal_periods': 52}
# Fit the model on the training set for UNIT_SALES
exp_model_unit = ExponentialSmoothing(train['UNIT_SALES'], **params).fit()
# Generate forecasts for the test set period
unit_sales_forecast = exp_model_unit.forecast(len(test))
# Calculate MAE and MSE for UNIT_SALES
mae_unit = mean_absolute_error(test['UNIT_SALES'], unit_sales_forecast)
mse_unit = mean_squared_error(test['UNIT_SALES'], unit_sales_forecast)
# Repeat the process for DOLLAR_SALES
exp_model_dollar = ExponentialSmoothing(train['DOLLAR_SALES'], **params).fit()
dollar_sales_forecast = exp_model_dollar.forecast(len(test))
mae_dollar = mean_absolute_error(test['DOLLAR_SALES'], dollar_sales_forecast)
mse_dollar = mean_squared_error(test['DOLLAR_SALES'], dollar_sales_forecast)
# Print the evaluation metrics
print(f'UNIT_SALES - MAE: {mae_unit}, MSE: {mse_unit}')
print(f'DOLLAR_SALES - MAE: {mae_dollar}, MSE: {mse_dollar}')
UNIT_SALES - MAE: 13323.255430846973, MSE: 214736828.1026801 DOLLAR_SALES - MAE: 87849.25315413318, MSE: 9191142635.488482
The MAE and MSE values for unit sales are 13323 and 214736828. For dollor sales the respected values are 87849 and 9191142635.
From the exponential smoothing model, the non- swire cc manufactures who sell products consists of the package '.5L 12One Jug'and category 'Ing Enhanced Water' in Southern region the best 13 weeks are July to October with sales 464591 and the dollar sales are 4258204 which is quite high when compared to Northern region. For non- swire cc products of the package '.5L 12One Jug'and category 'Ing Enhanced Water' the best 13 weeks are July to October in the Southern Region.
Item Description: Diet Energy Moonlit Casava 2L Multi Jug
Caloric Segment: Diet
Market Category: Energy
Manufacturer: Swire-CC
Brand: Diet Moonlit
Package Type: 2L Multi Jug
Flavor: ‘Cassava’
Swire plans to release this product for 13 weeks, but only in one region.
Which region would it perform best in?
We first applied filter to Category 'Energy' with 'Swire-CC', having caloric segment as 'Diet/Light'
Before building the model to forecast the sales, let's check the data and find the similar products in the given dataset. Since, the dataset is large we use Google Big query to filter the data based on the requirements and import those filtered data to this notebook.
In the dataset provided to us we don't have dataset with combinations of Flavor: 'Cassava'. So, we fist consider the other caloric segment : Diet/light Market Category: energy, Manufacturer : Swire-CC.
from google.colab import auth
from google.cloud import bigquery
from google.colab import data_table
project = 'spring-swire-ca' # Project ID inserted based on the query results selected to explore
location = 'US' # Location inserted based on the query results selected to explore
client = bigquery.Client(project=project, location=location)
data_table.enable_dataframe_formatter()
auth.authenticate_user()
job = client.get_job('bquxjob_457a6c38_18e96f5f940') # Job ID inserted based on the query results selected to explore
print(job.query)
SELECT DATE,SUM(UNIT_SALES) AS UNIT_SALES, SUM(DOLLAR_SALES) AS DOLLAR_SALES FROM `swirecc.fact_market_demand` WHERE CATEGORY = 'ENERGY' AND MANUFACTURER = 'SWIRE-CC' AND CALORIC_SEGMENT = 'DIET/LIGHT' GROUP BY DATE;
job = client.get_job('bquxjob_457a6c38_18e96f5f940') # Job ID inserted based on the query results selected to explore
results = job.to_dataframe()
results
| DATE | UNIT_SALES | DOLLAR_SALES | |
|---|---|---|---|
| 0 | 2021-01-09 | 4593.0 | 4303.21 |
| 1 | 2021-10-02 | 3856.0 | 3522.99 |
| 2 | 2021-08-28 | 4322.0 | 4004.36 |
| 3 | 2023-07-15 | 2036.0 | 2123.17 |
| 4 | 2021-08-14 | 4615.0 | 4211.04 |
| ... | ... | ... | ... |
| 134 | 2023-04-22 | 1925.0 | 2090.46 |
| 135 | 2023-07-22 | 2009.0 | 2135.09 |
| 136 | 2022-05-14 | 2841.0 | 3044.46 |
| 137 | 2022-02-05 | 3201.0 | 2870.83 |
| 138 | 2021-11-13 | 3704.0 | 3346.70 |
139 rows × 3 columns
We get the data from the google big query to this notebook and next we make changes to the dataframe 'result' by changing the 'DATE' format and dividing it into year, month and week.
import pandas as pd
# Convert 'DATE' column to datetime format
results['DATE'] = pd.to_datetime(results['DATE'])
# Extract relevant features for forecasting
forecast_features = results[['DATE', 'UNIT_SALES', 'DOLLAR_SALES']]
# Add additional time-related features
forecast_features['YEAR'] = forecast_features['DATE'].dt.year
forecast_features['MONTH'] = forecast_features['DATE'].dt.month
forecast_features['WEEK_OF_YEAR'] = forecast_features['DATE'].dt.isocalendar().week
# Display the prepared forecasting features
print("Forecasting Features:")
forecast_features
Forecasting Features:
| DATE | UNIT_SALES | DOLLAR_SALES | YEAR | MONTH | WEEK_OF_YEAR | |
|---|---|---|---|---|---|---|
| 0 | 2021-01-09 | 4593.0 | 4303.21 | 2021 | 1 | 1 |
| 1 | 2021-10-02 | 3856.0 | 3522.99 | 2021 | 10 | 39 |
| 2 | 2021-08-28 | 4322.0 | 4004.36 | 2021 | 8 | 34 |
| 3 | 2023-07-15 | 2036.0 | 2123.17 | 2023 | 7 | 28 |
| 4 | 2021-08-14 | 4615.0 | 4211.04 | 2021 | 8 | 32 |
| ... | ... | ... | ... | ... | ... | ... |
| 134 | 2023-04-22 | 1925.0 | 2090.46 | 2023 | 4 | 16 |
| 135 | 2023-07-22 | 2009.0 | 2135.09 | 2023 | 7 | 29 |
| 136 | 2022-05-14 | 2841.0 | 3044.46 | 2022 | 5 | 19 |
| 137 | 2022-02-05 | 3201.0 | 2870.83 | 2022 | 2 | 5 |
| 138 | 2021-11-13 | 3704.0 | 3346.70 | 2021 | 11 | 45 |
139 rows × 6 columns
We follow the same pattern for the import of the dataset from the google big query to this notebook and exporting year and month and week from the 'DATE' column throughout the notebook
Exponential smoothing is a forecasting method that uses weighted averages of past observations to predict new values. It is most effective when the values of the time series follow a gradual trend and display seasonal behavior in which the values follow a repeated cyclical pattern over a given number of time steps.
import pandas as pd
import matplotlib.pyplot as plt
from statsmodels.tsa.holtwinters import ExponentialSmoothing
# Ensure the DATE column is in datetime format and set as the DataFrame's index
forecast_features['DATE'] = pd.to_datetime(forecast_features['DATE'])
forecast_features.set_index('DATE', inplace=True)
# Sort the DataFrame by the datetime index
forecast_features.sort_index(inplace=True)
# Define the last date in the DataFrame for historical data
last_historical_date = forecast_features.index.max()
# Prepare the forecast index for the next year after the last date
forecast_index = pd.date_range(start=last_historical_date, periods=53, freq='W')[1:]
# Exponential Smoothing Forecast for UNIT_SALES
exp_model = ExponentialSmoothing(
forecast_features['UNIT_SALES'],
trend='add',
seasonal='add',
seasonal_periods=52
).fit()
exp_forecast = exp_model.forecast(52)
exp_forecast.index = forecast_index
# Exponential Smoothing Forecast for DOLLAR_SALES
exp_model_dollar = ExponentialSmoothing(
forecast_features['DOLLAR_SALES'],
trend='add',
seasonal='add',
seasonal_periods=52
).fit()
exp_forecast_dollar = exp_model_dollar.forecast(52)
exp_forecast_dollar.index = forecast_index
# Function to find the best 6 months (approximately 26 weeks)
def find_best_26_weeks(forecast):
rolling_sum = forecast.rolling(window=26, min_periods=1).sum()
best_period_end = rolling_sum.idxmax()
best_period_start = best_period_end - pd.DateOffset(weeks=25) # 26 weeks include the end week
return best_period_start, best_period_end
# Find the best 6 months for unit sales
best_start_unit, best_end_unit = find_best_26_weeks(exp_forecast)
# Find the best 6 months for dollar sales
best_start_dollar, best_end_dollar = find_best_26_weeks(exp_forecast_dollar)
# Plotting function with the best 6 months highlighted
def plot_forecast_with_highlights(forecast, best_start, best_end, title):
plt.figure(figsize=(14, 7))
plt.plot(forecast.index, forecast, label='Forecast')
plt.axvspan(best_start, best_end, color='orange', alpha=0.3, label='Best 6 Months')
plt.title(title)
plt.xlabel('Date')
plt.ylabel('Sales')
plt.legend()
plt.show()
# Plot the forecasts with the best 6 months highlighted
plot_forecast_with_highlights(exp_forecast, best_start_unit, best_end_unit, 'Unit Sales Forecast with Best 6 Months Highlighted')
plot_forecast_with_highlights(exp_forecast_dollar, best_start_dollar, best_end_dollar, 'Dollar Sales Forecast with Best 6 Months Highlighted')
/usr/local/lib/python3.10/dist-packages/statsmodels/tsa/base/tsa_model.py:473: ValueWarning: A date index has been provided, but it has no associated frequency information and so will be ignored when e.g. forecasting. self._init_dates(dates, freq) /usr/local/lib/python3.10/dist-packages/statsmodels/tsa/base/tsa_model.py:836: ValueWarning: No supported index is available. Prediction results will be given with an integer index beginning at `start`. return get_prediction_index( /usr/local/lib/python3.10/dist-packages/statsmodels/tsa/base/tsa_model.py:836: FutureWarning: No supported index is available. In the next version, calling this method in a model without a supported index will result in an exception. return get_prediction_index( /usr/local/lib/python3.10/dist-packages/statsmodels/tsa/base/tsa_model.py:473: ValueWarning: A date index has been provided, but it has no associated frequency information and so will be ignored when e.g. forecasting. self._init_dates(dates, freq) /usr/local/lib/python3.10/dist-packages/statsmodels/tsa/base/tsa_model.py:836: ValueWarning: No supported index is available. Prediction results will be given with an integer index beginning at `start`. return get_prediction_index( /usr/local/lib/python3.10/dist-packages/statsmodels/tsa/base/tsa_model.py:836: FutureWarning: No supported index is available. In the next version, calling this method in a model without a supported index will result in an exception. return get_prediction_index(
From the plot we can see that the best 26 weeks for unit sales and dollor sales are from november to april.
# Define the function to find the best 26 weeks
def find_best_26_weeks(forecast):
rolling_sum = forecast.rolling(window=26, min_periods=1).sum()
best_period_end = rolling_sum.idxmax()
best_period_start = best_period_end - pd.DateOffset(weeks=25)
return best_period_start, best_period_end, rolling_sum.max()
# Find the best 26 weeks for unit sales
best_start_unit, best_end_unit, max_sales_unit = find_best_26_weeks(exp_forecast)
# Find the best 26 weeks for dollar sales
best_start_dollar, best_end_dollar, max_sales_dollar = find_best_26_weeks(exp_forecast_dollar)
# Output the best periods and total sales
print(f"Best 26 weeks for unit sales start on {best_start_unit.date()} and end on {best_end_unit.date()}, with total sales: {max_sales_unit}")
print(f"Best 26 weeks for dollar sales start on {best_start_dollar.date()} and end on {best_end_dollar.date()}, with total sales: {max_sales_dollar}")
# Now, let's find the values for the best 26 weeks
best_26_weeks_values_unit = exp_forecast.loc[best_start_unit:best_end_unit]
best_26_weeks_values_dollar = exp_forecast_dollar.loc[best_start_dollar:best_end_dollar]
# Print out the results
print("Best 26 weeks for Unit Sales:")
print(best_26_weeks_values_unit)
print("\nBest 26 weeks for Dollar Sales:")
print(best_26_weeks_values_dollar)
# Extracting the month names for visualization
best_months_unit = best_26_weeks_values_unit.index.month_name().unique()
best_months_dollar = best_26_weeks_values_dollar.index.month_name().unique()
print("\nBest months for Unit Sales within the 26-week period:")
print(best_months_unit)
print("\nBest months for Dollar Sales within the 26-week period:")
print(best_months_dollar)
Best 26 weeks for unit sales start on 2023-11-05 and end on 2024-04-28, with total sales: 40468.83476378765 Best 26 weeks for dollar sales start on 2023-11-05 and end on 2024-04-28, with total sales: 46427.21591674558 Best 26 weeks for Unit Sales: 2023-11-05 2171.170239 2023-11-12 2321.668503 2023-11-19 2193.372092 2023-11-26 2047.140098 2023-12-03 2296.533245 2023-12-10 2369.246592 2023-12-17 2049.663623 2023-12-24 1974.224296 2023-12-31 1790.268407 2024-01-07 2193.979804 2024-01-14 1956.610374 2024-01-21 1502.508981 2024-01-28 1321.018341 2024-02-04 1291.585614 2024-02-11 1362.249492 2024-02-18 1238.811750 2024-02-25 962.418576 2024-03-03 1076.092944 2024-03-10 1343.447003 2024-03-17 1220.509414 2024-03-24 1102.451545 2024-03-31 1048.203694 2024-04-07 991.819266 2024-04-14 1138.972105 2024-04-21 797.561897 2024-04-28 707.306868 Freq: W-SUN, dtype: float64 Best 26 weeks for Dollar Sales: 2023-11-05 2304.329268 2023-11-12 2380.930372 2023-11-19 2293.462332 2023-11-26 2163.956281 2023-12-03 2447.167730 2023-12-10 2493.962990 2023-12-17 2217.521035 2023-12-24 2158.136001 2023-12-31 1956.502936 2024-01-07 2419.773739 2024-01-14 2115.183791 2024-01-21 1631.926909 2024-01-28 1491.065328 2024-02-04 1537.274507 2024-02-11 1628.122270 2024-02-18 1524.160686 2024-02-25 1252.504368 2024-03-03 1376.407396 2024-03-10 1656.640720 2024-03-17 1487.357865 2024-03-24 1508.118727 2024-03-31 1443.083223 2024-04-07 1372.871905 2024-04-14 1461.203071 2024-04-21 1104.389924 2024-04-28 1001.162543 Freq: W-SUN, dtype: float64 Best months for Unit Sales within the 26-week period: Index(['November', 'December', 'January', 'February', 'March', 'April'], dtype='object') Best months for Dollar Sales within the 26-week period: Index(['November', 'December', 'January', 'February', 'March', 'April'], dtype='object')
The total unit sales of these products in these 26 weeks are 40468 and the dollar sales are 46427.
Lets evaluate the model performance metrics
from sklearn.metrics import mean_squared_error, mean_absolute_error
# Split the data into train and test sets
split_point = int(len(forecast_features) * 0.8) # for an 80/20 split
train = forecast_features.iloc[:split_point]
test = forecast_features.iloc[split_point:]
# Define the parameters for the Exponential Smoothing model
params = {'trend': 'add', 'seasonal': 'add', 'seasonal_periods': 52}
# Fit the model on the training set for UNIT_SALES
exp_model_unit = ExponentialSmoothing(train['UNIT_SALES'], **params).fit()
# Generate forecasts for the test set period
unit_sales_forecast = exp_model_unit.forecast(len(test))
# Calculate MAE and MSE for UNIT_SALES
mae_unit = mean_absolute_error(test['UNIT_SALES'], unit_sales_forecast)
mse_unit = mean_squared_error(test['UNIT_SALES'], unit_sales_forecast)
# Repeat the process for DOLLAR_SALES
exp_model_dollar = ExponentialSmoothing(train['DOLLAR_SALES'], **params).fit()
dollar_sales_forecast = exp_model_dollar.forecast(len(test))
mae_dollar = mean_absolute_error(test['DOLLAR_SALES'], dollar_sales_forecast)
mse_dollar = mean_squared_error(test['DOLLAR_SALES'], dollar_sales_forecast)
# Print the evaluation metrics
print(f'UNIT_SALES - MAE: {mae_unit}, MSE: {mse_unit}')
print(f'DOLLAR_SALES - MAE: {mae_dollar}, MSE: {mse_dollar}')
/usr/local/lib/python3.10/dist-packages/statsmodels/tsa/base/tsa_model.py:473: ValueWarning: No frequency information was provided, so inferred frequency W-SAT will be used. self._init_dates(dates, freq) /usr/local/lib/python3.10/dist-packages/statsmodels/tsa/base/tsa_model.py:473: ValueWarning: No frequency information was provided, so inferred frequency W-SAT will be used. self._init_dates(dates, freq)
UNIT_SALES - MAE: 239.80472298824847, MSE: 88183.82539630556 DOLLAR_SALES - MAE: 417.2807652676282, MSE: 262579.3380016585
The MAE and MSE values for unit sales are 239 and 88183. For dollor sales the respected values are 417 and 262579.
From the model, we can say the best 6 months sales from November to April with the The total unit sales are 40468 and the dollar sales are 46427.
Since there is no flavour category in this combination,we use non swire-cc data to model the flavour category.
We first applied filter to Flavor 'Casava' with 'Non Swire-CC', having caloric segment as 'Diet/Light'
Before building the model to forecast the sales, let's check the data and find the similar products in the given dataset. Since, the dataset is large we use Google Big query to filter the data based on the requirements and import those filtered data to this notebook.
In the dataset provided to us we don't have dataset with combinations of package . So, we fist consider the other caloric segment : Diet/light Manufacturer : Swire-CC. Flavor: 'Cassava'
from google.colab import auth
from google.cloud import bigquery
from google.colab import data_table
project = 'spring-swire-ca' # Project ID inserted based on the query results selected to explore
location = 'US' # Location inserted based on the query results selected to explore
client = bigquery.Client(project=project, location=location)
data_table.enable_dataframe_formatter()
auth.authenticate_user()
job = client.get_job('bquxjob_64491685_18e97366214') # Job ID inserted based on the query results selected to explore
print(job.query)
SELECT DATE,SUM(UNIT_SALES) AS UNIT_SALES, SUM(DOLLAR_SALES) AS DOLLAR_SALES FROM `swirecc.fact_market_demand` WHERE ITEM LIKE '%CASAVA%' AND MANUFACTURER != 'SWIRE-CC' AND CALORIC_SEGMENT = 'DIET/LIGHT' GROUP BY DATE;
job = client.get_job('bquxjob_64491685_18e97366214') # Job ID inserted based on the query results selected to explore
results = job.to_dataframe()
results
| DATE | UNIT_SALES | DOLLAR_SALES | |
|---|---|---|---|
| 0 | 2021-04-03 | 6610.0 | 24789.77 |
| 1 | 2021-07-17 | 24918.0 | 87976.93 |
| 2 | 2023-05-06 | 43457.0 | 177678.72 |
| 3 | 2021-04-24 | 6645.0 | 24910.45 |
| 4 | 2021-03-13 | 6783.0 | 25277.23 |
| ... | ... | ... | ... |
| 143 | 2022-06-25 | 50474.0 | 180745.74 |
| 144 | 2022-06-11 | 46432.0 | 166148.73 |
| 145 | 2023-03-11 | 43351.0 | 176258.03 |
| 146 | 2021-10-23 | 13933.0 | 50896.24 |
| 147 | 2023-02-11 | 45757.0 | 175155.80 |
148 rows × 3 columns
We get the data from the google big query to this notebook and next we make changes to the dataframe 'result' by changing the 'DATE' format and dividing it into year, month and week.
# Convert 'DATE' column to datetime format
results['DATE'] = pd.to_datetime(results['DATE'])
# Extract relevant features for forecasting
forecast_features = results[['DATE', 'UNIT_SALES', 'DOLLAR_SALES']]
# Add additional time-related features
forecast_features['YEAR'] = forecast_features['DATE'].dt.year
forecast_features['MONTH'] = forecast_features['DATE'].dt.month
forecast_features['WEEK_OF_YEAR'] = forecast_features['DATE'].dt.isocalendar().week
# Display the prepared forecasting features
print("Forecasting Features:")
forecast_features
Forecasting Features:
| DATE | UNIT_SALES | DOLLAR_SALES | YEAR | MONTH | WEEK_OF_YEAR | |
|---|---|---|---|---|---|---|
| 0 | 2021-04-03 | 6610.0 | 24789.77 | 2021 | 4 | 13 |
| 1 | 2021-07-17 | 24918.0 | 87976.93 | 2021 | 7 | 28 |
| 2 | 2023-05-06 | 43457.0 | 177678.72 | 2023 | 5 | 18 |
| 3 | 2021-04-24 | 6645.0 | 24910.45 | 2021 | 4 | 16 |
| 4 | 2021-03-13 | 6783.0 | 25277.23 | 2021 | 3 | 10 |
| ... | ... | ... | ... | ... | ... | ... |
| 143 | 2022-06-25 | 50474.0 | 180745.74 | 2022 | 6 | 25 |
| 144 | 2022-06-11 | 46432.0 | 166148.73 | 2022 | 6 | 23 |
| 145 | 2023-03-11 | 43351.0 | 176258.03 | 2023 | 3 | 10 |
| 146 | 2021-10-23 | 13933.0 | 50896.24 | 2021 | 10 | 42 |
| 147 | 2023-02-11 | 45757.0 | 175155.80 | 2023 | 2 | 6 |
148 rows × 6 columns
We follow the same pattern for the import of the dataset from the google big query to this notebook and exporting year and month and week from the 'DATE' column throughout the notebook
Exponential smoothing is a forecasting method that uses weighted averages of past observations to predict new values. It is most effective when the values of the time series follow a gradual trend and display seasonal behavior in which the values follow a repeated cyclical pattern over a given number of time steps.
import pandas as pd
import matplotlib.pyplot as plt
from statsmodels.tsa.holtwinters import ExponentialSmoothing
# Ensure the DATE column is in datetime format and set as the DataFrame's index
forecast_features['DATE'] = pd.to_datetime(forecast_features['DATE'])
forecast_features.set_index('DATE', inplace=True)
# Sort the DataFrame by the datetime index
forecast_features.sort_index(inplace=True)
# Define the last date in the DataFrame for historical data
last_historical_date = forecast_features.index.max()
# Prepare the forecast index for the next year after the last date
forecast_index = pd.date_range(start=last_historical_date, periods=53, freq='W')[1:]
# Exponential Smoothing Forecast for UNIT_SALES
exp_model = ExponentialSmoothing(
forecast_features['UNIT_SALES'],
trend='add',
seasonal='add',
seasonal_periods=52
).fit()
exp_forecast = exp_model.forecast(52)
exp_forecast.index = forecast_index
# Exponential Smoothing Forecast for DOLLAR_SALES
exp_model_dollar = ExponentialSmoothing(
forecast_features['DOLLAR_SALES'],
trend='add',
seasonal='add',
seasonal_periods=52
).fit()
exp_forecast_dollar = exp_model_dollar.forecast(52)
exp_forecast_dollar.index = forecast_index
# Function to find the best 6 months (approximately 26 weeks)
def find_best_26_weeks(forecast):
rolling_sum = forecast.rolling(window=26, min_periods=1).sum()
best_period_end = rolling_sum.idxmax()
best_period_start = best_period_end - pd.DateOffset(weeks=25) # 26 weeks include the end week
return best_period_start, best_period_end
# Find the best 6 months for unit sales
best_start_unit, best_end_unit = find_best_26_weeks(exp_forecast)
# Find the best 6 months for dollar sales
best_start_dollar, best_end_dollar = find_best_26_weeks(exp_forecast_dollar)
# Plotting function with the best 6 months highlighted
def plot_forecast_with_highlights(forecast, best_start, best_end, title):
plt.figure(figsize=(14, 7))
plt.plot(forecast.index, forecast, label='Forecast')
plt.axvspan(best_start, best_end, color='orange', alpha=0.3, label='Best 6 Months')
plt.title(title)
plt.xlabel('Date')
plt.ylabel('Sales')
plt.legend()
plt.show()
# Plot the forecasts with the best 6 months highlighted
plot_forecast_with_highlights(exp_forecast, best_start_unit, best_end_unit, 'Unit Sales Forecast with Best 6 Months Highlighted')
plot_forecast_with_highlights(exp_forecast_dollar, best_start_dollar, best_end_dollar, 'Dollar Sales Forecast with Best 6 Months Highlighted')
/usr/local/lib/python3.10/dist-packages/statsmodels/tsa/base/tsa_model.py:473: ValueWarning: A date index has been provided, but it has no associated frequency information and so will be ignored when e.g. forecasting. self._init_dates(dates, freq) /usr/local/lib/python3.10/dist-packages/statsmodels/tsa/holtwinters/model.py:917: ConvergenceWarning: Optimization failed to converge. Check mle_retvals. warnings.warn( /usr/local/lib/python3.10/dist-packages/statsmodels/tsa/base/tsa_model.py:836: ValueWarning: No supported index is available. Prediction results will be given with an integer index beginning at `start`. return get_prediction_index( /usr/local/lib/python3.10/dist-packages/statsmodels/tsa/base/tsa_model.py:836: FutureWarning: No supported index is available. In the next version, calling this method in a model without a supported index will result in an exception. return get_prediction_index( /usr/local/lib/python3.10/dist-packages/statsmodels/tsa/base/tsa_model.py:473: ValueWarning: A date index has been provided, but it has no associated frequency information and so will be ignored when e.g. forecasting. self._init_dates(dates, freq) /usr/local/lib/python3.10/dist-packages/statsmodels/tsa/holtwinters/model.py:917: ConvergenceWarning: Optimization failed to converge. Check mle_retvals. warnings.warn( /usr/local/lib/python3.10/dist-packages/statsmodels/tsa/base/tsa_model.py:836: ValueWarning: No supported index is available. Prediction results will be given with an integer index beginning at `start`. return get_prediction_index( /usr/local/lib/python3.10/dist-packages/statsmodels/tsa/base/tsa_model.py:836: FutureWarning: No supported index is available. In the next version, calling this method in a model without a supported index will result in an exception. return get_prediction_index(
From the plot we can see that the best 26 weeks for unit sales and dollor sales are from April to october.
# Define the function to find the best 26 weeks
def find_best_26_weeks(forecast):
rolling_sum = forecast.rolling(window=26, min_periods=1).sum()
best_period_end = rolling_sum.idxmax()
best_period_start = best_period_end - pd.DateOffset(weeks=25)
return best_period_start, best_period_end, rolling_sum.max()
# Find the best 26 weeks for unit sales
best_start_unit, best_end_unit, max_sales_unit = find_best_26_weeks(exp_forecast)
# Find the best 26 weeks for dollar sales
best_start_dollar, best_end_dollar, max_sales_dollar = find_best_26_weeks(exp_forecast_dollar)
# Output the best periods and total sales
print(f"Best 26 weeks for unit sales start on {best_start_unit.date()} and end on {best_end_unit.date()}, with total sales: {max_sales_unit}")
print(f"Best 26 weeks for dollar sales start on {best_start_dollar.date()} and end on {best_end_dollar.date()}, with total sales: {max_sales_dollar}")
# Now, let's find the values for the best 26 weeks
best_26_weeks_values_unit = exp_forecast.loc[best_start_unit:best_end_unit]
best_26_weeks_values_dollar = exp_forecast_dollar.loc[best_start_dollar:best_end_dollar]
# Print out the results
print("Best 26 weeks for Unit Sales:")
print(best_26_weeks_values_unit)
print("\nBest 26 weeks for Dollar Sales:")
print(best_26_weeks_values_dollar)
# Extracting the month names for visualization
best_months_unit = best_26_weeks_values_unit.index.month_name().unique()
best_months_dollar = best_26_weeks_values_dollar.index.month_name().unique()
print("\nBest months for Unit Sales within the 26-week period:")
print(best_months_unit)
print("\nBest months for Dollar Sales within the 26-week period:")
print(best_months_dollar)
Best 26 weeks for unit sales start on 2024-04-21 and end on 2024-10-13, with total sales: 1227793.7380098142 Best 26 weeks for dollar sales start on 2024-04-21 and end on 2024-10-13, with total sales: 4914402.822625622 Best 26 weeks for Unit Sales: 2024-04-21 40418.033021 2024-04-28 42740.520955 2024-05-05 44012.896388 2024-05-12 44181.204299 2024-05-19 44054.482594 2024-05-26 44764.104845 2024-06-02 46214.687403 2024-06-09 50902.053672 2024-06-16 53630.811032 2024-06-23 54461.491084 2024-06-30 46443.993389 2024-07-07 40978.131545 2024-07-14 40902.448850 2024-07-21 41627.635865 2024-07-28 47028.436022 2024-08-04 48826.511522 2024-08-11 51455.305162 2024-08-18 54435.810559 2024-08-25 53226.901432 2024-09-01 52638.750902 2024-09-08 53115.994937 2024-09-15 49865.496388 2024-09-22 47035.393628 2024-09-29 47466.708458 2024-10-06 44586.997127 2024-10-13 42778.936930 Freq: W-SUN, dtype: float64 Best 26 weeks for Dollar Sales: 2024-04-21 171111.403175 2024-04-28 180729.921899 2024-05-05 186967.150778 2024-05-12 186669.077060 2024-05-19 184906.952382 2024-05-26 186274.268576 2024-06-02 189545.580864 2024-06-09 202315.589330 2024-06-16 207866.724115 2024-06-23 209819.995727 2024-06-30 178516.050156 2024-07-07 164623.789983 2024-07-14 164515.663724 2024-07-21 169518.801625 2024-07-28 187799.778810 2024-08-04 193557.738142 2024-08-11 202432.954340 2024-08-18 212103.515913 2024-08-25 206851.792013 2024-09-01 205743.429550 2024-09-08 206073.407932 2024-09-15 195979.133752 2024-09-22 186706.393930 2024-09-29 185569.311947 2024-10-06 176809.675542 2024-10-13 171394.721363 Freq: W-SUN, dtype: float64 Best months for Unit Sales within the 26-week period: Index(['April', 'May', 'June', 'July', 'August', 'September', 'October'], dtype='object') Best months for Dollar Sales within the 26-week period: Index(['April', 'May', 'June', 'July', 'August', 'September', 'October'], dtype='object')
The total unit sales of these products in these 26 weeks are 1227793 and the dollar sales are 4914402.
Lets evaluate model performance metrics
from sklearn.metrics import mean_squared_error, mean_absolute_error
# Split the data into train and test sets
split_point = int(len(forecast_features) * 0.8) # for an 80/20 split
train = forecast_features.iloc[:split_point]
test = forecast_features.iloc[split_point:]
# Define the parameters for the Exponential Smoothing model
params = {'trend': 'add', 'seasonal': 'add', 'seasonal_periods': 52}
# Fit the model on the training set for UNIT_SALES
exp_model_unit = ExponentialSmoothing(train['UNIT_SALES'], **params).fit()
# Generate forecasts for the test set period
unit_sales_forecast = exp_model_unit.forecast(len(test))
# Calculate MAE and MSE for UNIT_SALES
mae_unit = mean_absolute_error(test['UNIT_SALES'], unit_sales_forecast)
mse_unit = mean_squared_error(test['UNIT_SALES'], unit_sales_forecast)
# Repeat the process for DOLLAR_SALES
exp_model_dollar = ExponentialSmoothing(train['DOLLAR_SALES'], **params).fit()
dollar_sales_forecast = exp_model_dollar.forecast(len(test))
mae_dollar = mean_absolute_error(test['DOLLAR_SALES'], dollar_sales_forecast)
mse_dollar = mean_squared_error(test['DOLLAR_SALES'], dollar_sales_forecast)
# Print the evaluation metrics
print(f'UNIT_SALES - MAE: {mae_unit}, MSE: {mse_unit}')
print(f'DOLLAR_SALES - MAE: {mae_dollar}, MSE: {mse_dollar}')
UNIT_SALES - MAE: 10314.963439861342, MSE: 133439789.47343811 DOLLAR_SALES - MAE: 38728.14920459904, MSE: 1828464035.8834724
/usr/local/lib/python3.10/dist-packages/statsmodels/tsa/base/tsa_model.py:473: ValueWarning: No frequency information was provided, so inferred frequency W-SAT will be used. self._init_dates(dates, freq) /usr/local/lib/python3.10/dist-packages/statsmodels/tsa/holtwinters/model.py:917: ConvergenceWarning: Optimization failed to converge. Check mle_retvals. warnings.warn( /usr/local/lib/python3.10/dist-packages/statsmodels/tsa/base/tsa_model.py:473: ValueWarning: No frequency information was provided, so inferred frequency W-SAT will be used. self._init_dates(dates, freq) /usr/local/lib/python3.10/dist-packages/statsmodels/tsa/holtwinters/model.py:917: ConvergenceWarning: Optimization failed to converge. Check mle_retvals. warnings.warn(
The MAE and MSE values for unit sales are 10314 and 133439789. For dollor sales the respected values are 38728 and 1828464035.
From the non-swire cc and flavour: cassava model, we can say the best 6 months sales from April to October with the unit sales of 1227793 and the dollar sales are 4914402.
In the next model we use package combination to explore the sales with package:2L multijug.
We first applied filter to Package '2L Multi Jug' with 'Swire-CC', having caloric segment as 'Diet/Light' and brand 'Diet Moonlit'
Before building the model to forecast the sales, let's check the data and find the similar products in the given dataset. Since, the dataset is large we use Google Big query to filter the data based on the requirements and import those filtered data to this notebook.
we fist consider the these specifications caloric segment : Diet/light Manufacturer : Swire-CC. package : 2L multijug brand : diet moonlit
from google.colab import auth
from google.cloud import bigquery
from google.colab import data_table
project = 'spring-swire-ca' # Project ID inserted based on the query results selected to explore
location = 'US' # Location inserted based on the query results selected to explore
client = bigquery.Client(project=project, location=location)
data_table.enable_dataframe_formatter()
auth.authenticate_user()
job = client.get_job('bquxjob_7af2f6fe_18e973a652e') # Job ID inserted based on the query results selected to explore
print(job.query)
SELECT DATE,SUM(UNIT_SALES) AS UNIT_SALES, SUM(DOLLAR_SALES) AS DOLLAR_SALES FROM `swirecc.fact_market_demand` WHERE PACKAGE = '2L MULTI JUG' AND MANUFACTURER = 'SWIRE-CC' AND CALORIC_SEGMENT = 'DIET/LIGHT' AND BRAND = 'DIET MOONLIT' GROUP BY DATE;
job = client.get_job('bquxjob_7af2f6fe_18e973a652e') # Job ID inserted based on the query results selected to explore
results = job.to_dataframe()
results
| DATE | UNIT_SALES | DOLLAR_SALES | |
|---|---|---|---|
| 0 | 2021-02-13 | 20212.0 | 30147.19 |
| 1 | 2022-08-20 | 16085.0 | 25081.16 |
| 2 | 2021-09-11 | 14129.0 | 21345.83 |
| 3 | 2022-04-23 | 18314.0 | 28393.05 |
| 4 | 2022-01-08 | 14753.0 | 21325.79 |
| ... | ... | ... | ... |
| 142 | 2021-08-28 | 16980.0 | 22703.26 |
| 143 | 2021-01-09 | 20439.0 | 26439.57 |
| 144 | 2021-10-02 | 16770.0 | 25637.67 |
| 145 | 2023-08-05 | 20631.0 | 33975.46 |
| 146 | 2022-09-10 | 17558.0 | 30238.48 |
147 rows × 3 columns
We get the data from the google big query to this notebook and next we make changes to the dataframe 'result' by changing the 'DATE' format and dividing it into year, month and week.
# Convert 'DATE' column to datetime format
results['DATE'] = pd.to_datetime(results['DATE'])
# Extract relevant features for forecasting
forecast_features = results[['DATE', 'UNIT_SALES', 'DOLLAR_SALES']]
# Add additional time-related features
forecast_features['YEAR'] = forecast_features['DATE'].dt.year
forecast_features['MONTH'] = forecast_features['DATE'].dt.month
forecast_features['WEEK_OF_YEAR'] = forecast_features['DATE'].dt.isocalendar().week
# Display the prepared forecasting features
print("Forecasting Features:")
forecast_features
Forecasting Features:
| DATE | UNIT_SALES | DOLLAR_SALES | YEAR | MONTH | WEEK_OF_YEAR | |
|---|---|---|---|---|---|---|
| 0 | 2021-02-13 | 20212.0 | 30147.19 | 2021 | 2 | 6 |
| 1 | 2022-08-20 | 16085.0 | 25081.16 | 2022 | 8 | 33 |
| 2 | 2021-09-11 | 14129.0 | 21345.83 | 2021 | 9 | 36 |
| 3 | 2022-04-23 | 18314.0 | 28393.05 | 2022 | 4 | 16 |
| 4 | 2022-01-08 | 14753.0 | 21325.79 | 2022 | 1 | 1 |
| ... | ... | ... | ... | ... | ... | ... |
| 142 | 2021-08-28 | 16980.0 | 22703.26 | 2021 | 8 | 34 |
| 143 | 2021-01-09 | 20439.0 | 26439.57 | 2021 | 1 | 1 |
| 144 | 2021-10-02 | 16770.0 | 25637.67 | 2021 | 10 | 39 |
| 145 | 2023-08-05 | 20631.0 | 33975.46 | 2023 | 8 | 31 |
| 146 | 2022-09-10 | 17558.0 | 30238.48 | 2022 | 9 | 36 |
147 rows × 6 columns
We follow the same pattern for the import of the dataset from the google big query to this notebook and exporting year and month and week from the 'DATE' column throughout the notebook
Exponential smoothing is a forecasting method that uses weighted averages of past observations to predict new values. It is most effective when the values of the time series follow a gradual trend and display seasonal behavior in which the values follow a repeated cyclical pattern over a given number of time steps.
import pandas as pd
import matplotlib.pyplot as plt
from statsmodels.tsa.holtwinters import ExponentialSmoothing
# Ensure the DATE column is in datetime format and set as the DataFrame's index
forecast_features['DATE'] = pd.to_datetime(forecast_features['DATE'])
forecast_features.set_index('DATE', inplace=True)
# Sort the DataFrame by the datetime index
forecast_features.sort_index(inplace=True)
# Define the last date in the DataFrame for historical data
last_historical_date = forecast_features.index.max()
# Prepare the forecast index for the next year after the last date
forecast_index = pd.date_range(start=last_historical_date, periods=53, freq='W')[1:]
# Exponential Smoothing Forecast for UNIT_SALES
exp_model = ExponentialSmoothing(
forecast_features['UNIT_SALES'],
trend='add',
seasonal='add',
seasonal_periods=52
).fit()
exp_forecast = exp_model.forecast(52)
exp_forecast.index = forecast_index
# Exponential Smoothing Forecast for DOLLAR_SALES
exp_model_dollar = ExponentialSmoothing(
forecast_features['DOLLAR_SALES'],
trend='add',
seasonal='add',
seasonal_periods=52
).fit()
exp_forecast_dollar = exp_model_dollar.forecast(52)
exp_forecast_dollar.index = forecast_index
# Function to find the best 6 months (approximately 26 weeks)
def find_best_26_weeks(forecast):
rolling_sum = forecast.rolling(window=26, min_periods=1).sum()
best_period_end = rolling_sum.idxmax()
best_period_start = best_period_end - pd.DateOffset(weeks=25) # 26 weeks include the end week
return best_period_start, best_period_end
# Find the best 6 months for unit sales
best_start_unit, best_end_unit = find_best_26_weeks(exp_forecast)
# Find the best 6 months for dollar sales
best_start_dollar, best_end_dollar = find_best_26_weeks(exp_forecast_dollar)
# Plotting function with the best 6 months highlighted
def plot_forecast_with_highlights(forecast, best_start, best_end, title):
plt.figure(figsize=(14, 7))
plt.plot(forecast.index, forecast, label='Forecast')
plt.axvspan(best_start, best_end, color='orange', alpha=0.3, label='Best 6 Months')
plt.title(title)
plt.xlabel('Date')
plt.ylabel('Sales')
plt.legend()
plt.show()
# Plot the forecasts with the best 6 months highlighted
plot_forecast_with_highlights(exp_forecast, best_start_unit, best_end_unit, 'Unit Sales Forecast with Best 6 Months Highlighted')
plot_forecast_with_highlights(exp_forecast_dollar, best_start_dollar, best_end_dollar, 'Dollar Sales Forecast with Best 6 Months Highlighted')
/usr/local/lib/python3.10/dist-packages/statsmodels/tsa/base/tsa_model.py:473: ValueWarning: A date index has been provided, but it has no associated frequency information and so will be ignored when e.g. forecasting. self._init_dates(dates, freq) /usr/local/lib/python3.10/dist-packages/statsmodels/tsa/holtwinters/model.py:917: ConvergenceWarning: Optimization failed to converge. Check mle_retvals. warnings.warn( /usr/local/lib/python3.10/dist-packages/statsmodels/tsa/base/tsa_model.py:836: ValueWarning: No supported index is available. Prediction results will be given with an integer index beginning at `start`. return get_prediction_index( /usr/local/lib/python3.10/dist-packages/statsmodels/tsa/base/tsa_model.py:836: FutureWarning: No supported index is available. In the next version, calling this method in a model without a supported index will result in an exception. return get_prediction_index( /usr/local/lib/python3.10/dist-packages/statsmodels/tsa/base/tsa_model.py:473: ValueWarning: A date index has been provided, but it has no associated frequency information and so will be ignored when e.g. forecasting. self._init_dates(dates, freq) /usr/local/lib/python3.10/dist-packages/statsmodels/tsa/holtwinters/model.py:917: ConvergenceWarning: Optimization failed to converge. Check mle_retvals. warnings.warn( /usr/local/lib/python3.10/dist-packages/statsmodels/tsa/base/tsa_model.py:836: ValueWarning: No supported index is available. Prediction results will be given with an integer index beginning at `start`. return get_prediction_index( /usr/local/lib/python3.10/dist-packages/statsmodels/tsa/base/tsa_model.py:836: FutureWarning: No supported index is available. In the next version, calling this method in a model without a supported index will result in an exception. return get_prediction_index(
From the plot we can see that the best 26 weeks for unit sales are from november to may and dollor sales are from december to june.
# Define the function to find the best 26 weeks
def find_best_26_weeks(forecast):
rolling_sum = forecast.rolling(window=26, min_periods=1).sum()
best_period_end = rolling_sum.idxmax()
best_period_start = best_period_end - pd.DateOffset(weeks=25)
return best_period_start, best_period_end, rolling_sum.max()
# Find the best 26 weeks for unit sales
best_start_unit, best_end_unit, max_sales_unit = find_best_26_weeks(exp_forecast)
# Find the best 26 weeks for dollar sales
best_start_dollar, best_end_dollar, max_sales_dollar = find_best_26_weeks(exp_forecast_dollar)
# Output the best periods and total sales
print(f"Best 26 weeks for unit sales start on {best_start_unit.date()} and end on {best_end_unit.date()}, with total sales: {max_sales_unit}")
print(f"Best 26 weeks for dollar sales start on {best_start_dollar.date()} and end on {best_end_dollar.date()}, with total sales: {max_sales_dollar}")
# Now, let's find the values for the best 26 weeks
best_26_weeks_values_unit = exp_forecast.loc[best_start_unit:best_end_unit]
best_26_weeks_values_dollar = exp_forecast_dollar.loc[best_start_dollar:best_end_dollar]
# Print out the results
print("Best 26 weeks for Unit Sales:")
print(best_26_weeks_values_unit)
print("\nBest 26 weeks for Dollar Sales:")
print(best_26_weeks_values_dollar)
# Extracting the month names for visualization
best_months_unit = best_26_weeks_values_unit.index.month_name().unique()
best_months_dollar = best_26_weeks_values_dollar.index.month_name().unique()
print("\nBest months for Unit Sales within the 26-week period:")
print(best_months_unit)
print("\nBest months for Dollar Sales within the 26-week period:")
print(best_months_dollar)
Best 26 weeks for unit sales start on 2023-11-26 and end on 2024-05-19, with total sales: 427813.3789549731 Best 26 weeks for dollar sales start on 2023-12-17 and end on 2024-06-09, with total sales: 838114.9938714138 Best 26 weeks for Unit Sales: 2023-11-26 18273.557368 2023-12-03 17073.675711 2023-12-10 16790.699130 2023-12-17 17610.647375 2023-12-24 17945.467759 2023-12-31 19151.167046 2024-01-07 17359.638204 2024-01-14 17887.378321 2024-01-21 17425.926358 2024-01-28 18775.194943 2024-02-04 17264.305144 2024-02-11 15921.942614 2024-02-18 13670.293916 2024-02-25 13384.690415 2024-03-03 13376.592792 2024-03-10 13738.471327 2024-03-17 14194.946141 2024-03-24 14273.816455 2024-03-31 15187.829951 2024-04-07 17385.142863 2024-04-14 16268.308741 2024-04-21 16654.632591 2024-04-28 16482.525034 2024-05-05 17690.652919 2024-05-12 16515.122418 2024-05-19 17510.753420 Freq: W-SUN, dtype: float64 Best 26 weeks for Dollar Sales: 2023-12-17 32796.052595 2023-12-24 32715.376497 2023-12-31 34628.901927 2024-01-07 31799.171033 2024-01-14 32266.722821 2024-01-21 31948.042390 2024-01-28 33843.241264 2024-02-04 31938.606478 2024-02-11 30206.685693 2024-02-18 28399.765983 2024-02-25 27729.054990 2024-03-03 27690.232296 2024-03-10 28292.725203 2024-03-17 29027.944043 2024-03-24 29664.709153 2024-03-31 30794.732258 2024-04-07 35226.660164 2024-04-14 34599.946032 2024-04-21 33930.319733 2024-04-28 34028.651332 2024-05-05 35303.518607 2024-05-12 34312.141535 2024-05-19 34744.548158 2024-05-26 34202.556449 2024-06-02 34796.812176 2024-06-09 33227.875063 Freq: W-SUN, dtype: float64 Best months for Unit Sales within the 26-week period: Index(['November', 'December', 'January', 'February', 'March', 'April', 'May'], dtype='object') Best months for Dollar Sales within the 26-week period: Index(['December', 'January', 'February', 'March', 'April', 'May', 'June'], dtype='object')
The total unit sales of these products in these 26 weeks are 427813 and the dollar sales are 838114.
Lets evaluate model performance metrics
from sklearn.metrics import mean_squared_error, mean_absolute_error
# Split the data into train and test sets
split_point = int(len(forecast_features) * 0.8) # for an 80/20 split
train = forecast_features.iloc[:split_point]
test = forecast_features.iloc[split_point:]
# Define the parameters for the Exponential Smoothing model
params = {'trend': 'add', 'seasonal': 'add', 'seasonal_periods': 52}
# Fit the model on the training set for UNIT_SALES
exp_model_unit = ExponentialSmoothing(train['UNIT_SALES'], **params).fit()
# Generate forecasts for the test set period
unit_sales_forecast = exp_model_unit.forecast(len(test))
# Calculate MAE and MSE for UNIT_SALES
mae_unit = mean_absolute_error(test['UNIT_SALES'], unit_sales_forecast)
mse_unit = mean_squared_error(test['UNIT_SALES'], unit_sales_forecast)
# Repeat the process for DOLLAR_SALES
exp_model_dollar = ExponentialSmoothing(train['DOLLAR_SALES'], **params).fit()
dollar_sales_forecast = exp_model_dollar.forecast(len(test))
mae_dollar = mean_absolute_error(test['DOLLAR_SALES'], dollar_sales_forecast)
mse_dollar = mean_squared_error(test['DOLLAR_SALES'], dollar_sales_forecast)
# Print the evaluation metrics
print(f'UNIT_SALES - MAE: {mae_unit}, MSE: {mse_unit}')
print(f'DOLLAR_SALES - MAE: {mae_dollar}, MSE: {mse_dollar}')
UNIT_SALES - MAE: 1926.6166084072836, MSE: 5642614.361837015 DOLLAR_SALES - MAE: 2929.877972609832, MSE: 10958959.979422066
/usr/local/lib/python3.10/dist-packages/statsmodels/tsa/base/tsa_model.py:473: ValueWarning: No frequency information was provided, so inferred frequency W-SAT will be used. self._init_dates(dates, freq) /usr/local/lib/python3.10/dist-packages/statsmodels/tsa/holtwinters/model.py:917: ConvergenceWarning: Optimization failed to converge. Check mle_retvals. warnings.warn( /usr/local/lib/python3.10/dist-packages/statsmodels/tsa/base/tsa_model.py:473: ValueWarning: No frequency information was provided, so inferred frequency W-SAT will be used. self._init_dates(dates, freq) /usr/local/lib/python3.10/dist-packages/statsmodels/tsa/holtwinters/model.py:917: ConvergenceWarning: Optimization failed to converge. Check mle_retvals. warnings.warn(
The MAE and MSE values for unit sales are 1926 and 5642614. For dollor sales the respected values are 2929 and 10958959.
From the package: 2L multijug model, we can say the best 6 months salesthe best 26 weeks for unit sales are 427813 from november to may and the dollar sales are 838114 from december to june.
Item Description: Diet Square Mulberries Sparkling Water 10Small MLT
Caloric Segment: Diet
Market Category: Sparkling Water
Manufacturer: Swire-CC
Brand: Square
Package Type: 10Small MLT
Flavor: ‘Mulberries'
Swire plans to release this product for the duration of 1 year but only in the Northern region.
What will the forecasted demand be, in weeks, for this product?
We first filter caloric segment of 'Diet', category 'Sparkling Water', brand 'Square' with 'Swire CC'
from google.colab import auth
from google.cloud import bigquery
from google.colab import data_table
project = 'spring-swire-ca' # Project ID inserted based on the query results selected to explore
location = 'US' # Location inserted based on the query results selected to explore
client = bigquery.Client(project=project, location=location)
data_table.enable_dataframe_formatter()
auth.authenticate_user()
job = client.get_job('bquxjob_18d54420_18e977a5cdb') # Job ID inserted based on the query results selected to explore
print(job.query)
SELECT fmd.DATE,SUM(fmd.UNIT_SALES) AS UNIT_SALES, SUM(fmd.DOLLAR_SALES) AS DOLLAR_SALES
FROM `swirecc.fact_market_demand` fmd
JOIN (
SELECT DISTINCT zm.MARKET_KEY
FROM `swirecc.zip_to_market_unit_mapping` zm
LEFT JOIN `swirecc.consumer_demographics` cd
ON cd.Zip = zm.ZIP_CODE
WHERE cd.State NOT IN ('KS', 'UT', 'CA', 'CO', 'AZ', 'NM', 'NV')
) AS distinct_market_keys
ON fmd.MARKET_KEY = distinct_market_keys.MARKET_KEY
WHERE fmd.CALORIC_SEGMENT = 'DIET/LIGHT'
AND fmd.CATEGORY = 'SPARKLING WATER'
AND fmd.MANUFACTURER = 'SWIRE-CC'
AND fmd.BRAND = 'SQUARE'
GROUP BY DATE;
job = client.get_job('bquxjob_18d54420_18e977a5cdb') # Job ID inserted based on the query results selected to explore
results = job.to_dataframe()
results
| DATE | UNIT_SALES | DOLLAR_SALES | |
|---|---|---|---|
| 0 | 2023-06-10 | 4.0 | 13.56 |
| 1 | 2023-04-08 | 3.0 | 9.37 |
| 2 | 2023-02-25 | 4.0 | 10.96 |
| 3 | 2023-09-30 | 46.0 | 73.95 |
| 4 | 2023-07-15 | 2.0 | 6.78 |
| 5 | 2023-09-16 | 48.0 | 69.57 |
| 6 | 2023-07-29 | 3.0 | 10.17 |
| 7 | 2023-10-07 | 36.0 | 58.14 |
| 8 | 2023-10-14 | 94.0 | 131.73 |
| 9 | 2023-05-27 | 4.0 | 12.36 |
| 10 | 2023-03-25 | 6.0 | 18.74 |
| 11 | 2023-09-02 | 7.0 | 21.33 |
| 12 | 2023-05-06 | 9.0 | 28.51 |
| 13 | 2023-01-28 | 4.0 | 11.96 |
| 14 | 2023-04-22 | 7.0 | 22.53 |
| 15 | 2023-07-22 | 3.0 | 10.17 |
| 16 | 2023-07-01 | 1.0 | 3.39 |
| 17 | 2023-06-24 | 3.0 | 10.17 |
| 18 | 2023-04-01 | 6.0 | 20.34 |
| 19 | 2023-05-13 | 7.0 | 22.53 |
| 20 | 2023-06-17 | 2.0 | 6.78 |
| 21 | 2023-05-20 | 3.0 | 8.97 |
| 22 | 2023-01-21 | 1.0 | 3.29 |
| 23 | 2023-09-09 | 4.0 | 11.96 |
| 24 | 2023-04-15 | 3.0 | 9.77 |
| 25 | 2023-03-18 | 4.0 | 11.76 |
| 26 | 2023-03-04 | 12.0 | 35.88 |
| 27 | 2023-06-03 | 1.0 | 3.39 |
| 28 | 2023-04-29 | 3.0 | 10.17 |
| 29 | 2023-09-23 | 53.0 | 86.52 |
| 30 | 2023-10-28 | 83.0 | 138.98 |
| 31 | 2023-03-11 | 4.0 | 11.76 |
| 32 | 2023-02-18 | 3.0 | 8.97 |
| 33 | 2023-02-04 | 3.0 | 8.97 |
| 34 | 2023-02-11 | 3.0 | 8.97 |
| 35 | 2023-10-21 | 85.0 | 133.70 |
# Converting 'DATE' column to datetime format
results['DATE'] = pd.to_datetime(results['DATE'])
# Extracting relevant features for forecasting
forecast_features = results[['DATE', 'UNIT_SALES', 'DOLLAR_SALES']]
# Adding additional time-related features
forecast_features['YEAR'] = forecast_features['DATE'].dt.year
forecast_features['MONTH'] = forecast_features['DATE'].dt.month
forecast_features['WEEK_OF_YEAR'] = forecast_features['DATE'].dt.isocalendar().week
# Displaying the prepared forecasting features
print("Forecasting Features:")
forecast_features
Forecasting Features:
| DATE | UNIT_SALES | DOLLAR_SALES | YEAR | MONTH | WEEK_OF_YEAR | |
|---|---|---|---|---|---|---|
| 0 | 2023-06-10 | 4.0 | 13.56 | 2023 | 6 | 23 |
| 1 | 2023-04-08 | 3.0 | 9.37 | 2023 | 4 | 14 |
| 2 | 2023-02-25 | 4.0 | 10.96 | 2023 | 2 | 8 |
| 3 | 2023-09-30 | 46.0 | 73.95 | 2023 | 9 | 39 |
| 4 | 2023-07-15 | 2.0 | 6.78 | 2023 | 7 | 28 |
| 5 | 2023-09-16 | 48.0 | 69.57 | 2023 | 9 | 37 |
| 6 | 2023-07-29 | 3.0 | 10.17 | 2023 | 7 | 30 |
| 7 | 2023-10-07 | 36.0 | 58.14 | 2023 | 10 | 40 |
| 8 | 2023-10-14 | 94.0 | 131.73 | 2023 | 10 | 41 |
| 9 | 2023-05-27 | 4.0 | 12.36 | 2023 | 5 | 21 |
| 10 | 2023-03-25 | 6.0 | 18.74 | 2023 | 3 | 12 |
| 11 | 2023-09-02 | 7.0 | 21.33 | 2023 | 9 | 35 |
| 12 | 2023-05-06 | 9.0 | 28.51 | 2023 | 5 | 18 |
| 13 | 2023-01-28 | 4.0 | 11.96 | 2023 | 1 | 4 |
| 14 | 2023-04-22 | 7.0 | 22.53 | 2023 | 4 | 16 |
| 15 | 2023-07-22 | 3.0 | 10.17 | 2023 | 7 | 29 |
| 16 | 2023-07-01 | 1.0 | 3.39 | 2023 | 7 | 26 |
| 17 | 2023-06-24 | 3.0 | 10.17 | 2023 | 6 | 25 |
| 18 | 2023-04-01 | 6.0 | 20.34 | 2023 | 4 | 13 |
| 19 | 2023-05-13 | 7.0 | 22.53 | 2023 | 5 | 19 |
| 20 | 2023-06-17 | 2.0 | 6.78 | 2023 | 6 | 24 |
| 21 | 2023-05-20 | 3.0 | 8.97 | 2023 | 5 | 20 |
| 22 | 2023-01-21 | 1.0 | 3.29 | 2023 | 1 | 3 |
| 23 | 2023-09-09 | 4.0 | 11.96 | 2023 | 9 | 36 |
| 24 | 2023-04-15 | 3.0 | 9.77 | 2023 | 4 | 15 |
| 25 | 2023-03-18 | 4.0 | 11.76 | 2023 | 3 | 11 |
| 26 | 2023-03-04 | 12.0 | 35.88 | 2023 | 3 | 9 |
| 27 | 2023-06-03 | 1.0 | 3.39 | 2023 | 6 | 22 |
| 28 | 2023-04-29 | 3.0 | 10.17 | 2023 | 4 | 17 |
| 29 | 2023-09-23 | 53.0 | 86.52 | 2023 | 9 | 38 |
| 30 | 2023-10-28 | 83.0 | 138.98 | 2023 | 10 | 43 |
| 31 | 2023-03-11 | 4.0 | 11.76 | 2023 | 3 | 10 |
| 32 | 2023-02-18 | 3.0 | 8.97 | 2023 | 2 | 7 |
| 33 | 2023-02-04 | 3.0 | 8.97 | 2023 | 2 | 5 |
| 34 | 2023-02-11 | 3.0 | 8.97 | 2023 | 2 | 6 |
| 35 | 2023-10-21 | 85.0 | 133.70 | 2023 | 10 | 42 |
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from statsmodels.tsa.holtwinters import ExponentialSmoothing
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.tsa.statespace.sarimax import SARIMAX
from prophet import Prophet
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error
# Converting 'DATE' column to datetime format and setting it as index
results['DATE'] = pd.to_datetime(results['DATE'])
results.set_index('DATE', inplace=True)
# Forecasting for the next 52 weeks (1 year)
forecast_period = 52
# Prophet Model
df_prophet = forecast_features.reset_index().rename(columns={'DATE': 'ds', 'UNIT_SALES': 'y'})
prophet_model = Prophet(yearly_seasonality=True, weekly_seasonality=False, daily_seasonality=False)
prophet_model.fit(df_prophet)
future = prophet_model.make_future_dataframe(periods=52, freq='W')
prophet_forecast = prophet_model.predict(future)['yhat'].tail(52)
DEBUG:cmdstanpy:input tempfile: /tmp/tmp7x5uyqfk/k1q59pb3.json DEBUG:cmdstanpy:input tempfile: /tmp/tmp7x5uyqfk/z86_nkut.json DEBUG:cmdstanpy:idx 0 DEBUG:cmdstanpy:running CmdStan, num_threads: None DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.10/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=77920', 'data', 'file=/tmp/tmp7x5uyqfk/k1q59pb3.json', 'init=/tmp/tmp7x5uyqfk/z86_nkut.json', 'output', 'file=/tmp/tmp7x5uyqfk/prophet_model2rs8l8f7/prophet_model-20240401024313.csv', 'method=optimize', 'algorithm=newton', 'iter=10000'] 02:43:13 - cmdstanpy - INFO - Chain [1] start processing INFO:cmdstanpy:Chain [1] start processing 02:43:14 - cmdstanpy - INFO - Chain [1] done processing INFO:cmdstanpy:Chain [1] done processing
# Visualizing the forecasts
plt.figure(figsize=(15, 7))
plt.plot(future['ds'].tail(52), prophet_forecast, label='Prophet Forecast')
plt.legend()
plt.title('Prophet Forecasting')
plt.show()
Since there is no enough data it is hard to evaluate MAE and MSE scores with this data. Let's try with other combination
There is a peak sale in december 2023 with the prophet forecast and then it is rapidly falling down. It means the sales are higher during 'Christmas' month and then dropping later in the year.
We now filter on caloric segment 'Diet', category 'Sparkling Water', flavor 'Mulberries' with 'Non Swire-CC'
from google.colab import auth
from google.cloud import bigquery
from google.colab import data_table
project = 'spring-swire-ca' # Project ID inserted based on the query results selected to explore
location = 'US' # Location inserted based on the query results selected to explore
client = bigquery.Client(project=project, location=location)
data_table.enable_dataframe_formatter()
auth.authenticate_user()
job = client.get_job('bquxjob_68ccdfbc_18e97945b74') # Job ID inserted based on the query results selected to explore
print(job.query)
SELECT fmd.DATE,SUM(fmd.UNIT_SALES) AS UNIT_SALES, SUM(fmd.DOLLAR_SALES) AS DOLLAR_SALES
FROM `swirecc.fact_market_demand` fmd
JOIN (
SELECT DISTINCT zm.MARKET_KEY
FROM `swirecc.zip_to_market_unit_mapping` zm
LEFT JOIN `swirecc.consumer_demographics` cd
ON cd.Zip = zm.ZIP_CODE
WHERE cd.State NOT IN ('KS', 'UT', 'CA', 'CO', 'AZ', 'NM', 'NV')
) AS distinct_market_keys
ON fmd.MARKET_KEY = distinct_market_keys.MARKET_KEY
WHERE fmd.CALORIC_SEGMENT = 'DIET/LIGHT'
AND fmd.MANUFACTURER != 'SWIRE-CC'
AND ITEM LIKE '%MULBERRIES%'
AND fmd.CATEGORY = 'SPARKLING WATER'
GROUP BY DATE;
job = client.get_job('bquxjob_68ccdfbc_18e97945b74') # Job ID inserted based on the query results selected to explore
results = job.to_dataframe()
results
| DATE | UNIT_SALES | DOLLAR_SALES | |
|---|---|---|---|
| 0 | 2021-07-10 | 28246.0 | 84693.04 |
| 1 | 2021-01-09 | 25132.0 | 79652.82 |
| 2 | 2022-04-02 | 22797.0 | 74193.46 |
| 3 | 2021-10-02 | 23514.0 | 70526.81 |
| 4 | 2022-09-24 | 26249.0 | 87931.94 |
| ... | ... | ... | ... |
| 143 | 2022-11-12 | 19952.0 | 70432.09 |
| 144 | 2022-05-14 | 26992.0 | 86113.05 |
| 145 | 2021-02-20 | 23433.0 | 72967.56 |
| 146 | 2022-05-07 | 24156.0 | 78188.48 |
| 147 | 2022-10-22 | 21027.0 | 70427.62 |
148 rows × 3 columns
# Converting 'DATE' column to datetime format
results['DATE'] = pd.to_datetime(results['DATE'])
# Extracting relevant features for forecasting
forecast_features = results[['DATE', 'UNIT_SALES', 'DOLLAR_SALES']]
# Adding additional time-related features
forecast_features['YEAR'] = forecast_features['DATE'].dt.year
forecast_features['MONTH'] = forecast_features['DATE'].dt.month
forecast_features['WEEK_OF_YEAR'] = forecast_features['DATE'].dt.isocalendar().week
# Displaying the prepared forecasting features
print("Forecasting Features:")
forecast_features
Forecasting Features:
| DATE | UNIT_SALES | DOLLAR_SALES | YEAR | MONTH | WEEK_OF_YEAR | |
|---|---|---|---|---|---|---|
| 0 | 2021-07-10 | 28246.0 | 84693.04 | 2021 | 7 | 27 |
| 1 | 2021-01-09 | 25132.0 | 79652.82 | 2021 | 1 | 1 |
| 2 | 2022-04-02 | 22797.0 | 74193.46 | 2022 | 4 | 13 |
| 3 | 2021-10-02 | 23514.0 | 70526.81 | 2021 | 10 | 39 |
| 4 | 2022-09-24 | 26249.0 | 87931.94 | 2022 | 9 | 38 |
| ... | ... | ... | ... | ... | ... | ... |
| 143 | 2022-11-12 | 19952.0 | 70432.09 | 2022 | 11 | 45 |
| 144 | 2022-05-14 | 26992.0 | 86113.05 | 2022 | 5 | 19 |
| 145 | 2021-02-20 | 23433.0 | 72967.56 | 2021 | 2 | 7 |
| 146 | 2022-05-07 | 24156.0 | 78188.48 | 2022 | 5 | 18 |
| 147 | 2022-10-22 | 21027.0 | 70427.62 | 2022 | 10 | 42 |
148 rows × 6 columns
import pandas as pd
import matplotlib.pyplot as plt
from statsmodels.tsa.holtwinters import ExponentialSmoothing
# Converting 'DATE' to datetime format if necessary and sort
forecast_features['DATE'] = pd.to_datetime(forecast_features['DATE'])
forecast_features.sort_values('DATE', inplace=True)
forecast_features.set_index('DATE', inplace=True)
# Define the model
exp_model = ExponentialSmoothing(
forecast_features['UNIT_SALES'],
trend='add',
seasonal='add',
seasonal_periods=52
).fit()
# Forecast 52 periods into the future
exp_forecast = exp_model.forecast(52)
# Create a new DateTimeIndex for the forecast
last_date = forecast_features.index[-1]
forecast_index = pd.date_range(start=last_date + pd.Timedelta(days=1), periods=52, freq='W')
# Assign the new index to the forecast series
exp_forecast.index = forecast_index
# Plot the historical and forecasted data
plt.figure(figsize=(14, 7))
plt.plot(forecast_features.index, forecast_features['UNIT_SALES'], label='Historical UNIT_SALES', color='blue')
plt.plot(exp_forecast.index, exp_forecast, label='Forecasted UNIT_SALES', linestyle='--', color='orange')
plt.title('1-Year Forecast for UNIT_SALES')
plt.xlabel('Date')
plt.ylabel('UNIT_SALES')
plt.legend()
plt.show()
/usr/local/lib/python3.10/dist-packages/statsmodels/tsa/base/tsa_model.py:473: ValueWarning: A date index has been provided, but it has no associated frequency information and so will be ignored when e.g. forecasting. self._init_dates(dates, freq) /usr/local/lib/python3.10/dist-packages/statsmodels/tsa/holtwinters/model.py:917: ConvergenceWarning: Optimization failed to converge. Check mle_retvals. warnings.warn( /usr/local/lib/python3.10/dist-packages/statsmodels/tsa/base/tsa_model.py:836: ValueWarning: No supported index is available. Prediction results will be given with an integer index beginning at `start`. return get_prediction_index( /usr/local/lib/python3.10/dist-packages/statsmodels/tsa/base/tsa_model.py:836: FutureWarning: No supported index is available. In the next version, calling this method in a model without a supported index will result in an exception. return get_prediction_index(
# Defining the Exponential Smoothing model for DOLLAR_SALES
exp_model_dollar = ExponentialSmoothing(
forecast_features['DOLLAR_SALES'],
trend='add',
seasonal='add',
seasonal_periods=52
).fit()
# Forecast 52 periods into the future
exp_forecast_dollar = exp_model_dollar.forecast(52)
# The forecast index will be the same as for UNIT_SALES
exp_forecast_dollar.index = forecast_index
# Plotting the forecast for DOLLAR_SALES
plt.figure(figsize=(14, 7))
plt.plot(forecast_features.index, forecast_features['DOLLAR_SALES'], label='Historical DOLLAR_SALES', color='blue')
plt.plot(exp_forecast_dollar.index, exp_forecast_dollar, label='Forecasted DOLLAR_SALES', linestyle='--', color='orange')
plt.title('1-Year Forecast for DOLLAR_SALES')
plt.xlabel('Date')
plt.ylabel('DOLLAR_SALES')
plt.legend()
plt.show()
/usr/local/lib/python3.10/dist-packages/statsmodels/tsa/base/tsa_model.py:473: ValueWarning: A date index has been provided, but it has no associated frequency information and so will be ignored when e.g. forecasting. self._init_dates(dates, freq) /usr/local/lib/python3.10/dist-packages/statsmodels/tsa/holtwinters/model.py:917: ConvergenceWarning: Optimization failed to converge. Check mle_retvals. warnings.warn( /usr/local/lib/python3.10/dist-packages/statsmodels/tsa/base/tsa_model.py:836: ValueWarning: No supported index is available. Prediction results will be given with an integer index beginning at `start`. return get_prediction_index( /usr/local/lib/python3.10/dist-packages/statsmodels/tsa/base/tsa_model.py:836: FutureWarning: No supported index is available. In the next version, calling this method in a model without a supported index will result in an exception. return get_prediction_index(
from sklearn.metrics import mean_absolute_error, mean_squared_error
# Split the dataset
split_point = int(len(forecast_features) * 0.8)
train = forecast_features.iloc[:split_point]
test = forecast_features.iloc[split_point:]
# Fit the model on the training set
exp_model_dollar_train = ExponentialSmoothing(
train['DOLLAR_SALES'],
trend='add',
seasonal='add',
seasonal_periods=52
).fit()
# Forecast on the test set period
dollar_sales_forecast = exp_model_dollar_train.forecast(len(test))
# Calculate MAE and MSE using the actual and forecasted values
mae_dollar = mean_absolute_error(test['DOLLAR_SALES'], dollar_sales_forecast)
mse_dollar = mean_squared_error(test['DOLLAR_SALES'], dollar_sales_forecast)
# Print out the metrics
print(f'DOLLAR_SALES - MAE: {mae_dollar}, MSE: {mse_dollar}')
/usr/local/lib/python3.10/dist-packages/statsmodels/tsa/base/tsa_model.py:473: ValueWarning: No frequency information was provided, so inferred frequency W-SAT will be used. self._init_dates(dates, freq)
DOLLAR_SALES - MAE: 12590.832913584723, MSE: 217388758.4560217
/usr/local/lib/python3.10/dist-packages/statsmodels/tsa/holtwinters/model.py:917: ConvergenceWarning: Optimization failed to converge. Check mle_retvals. warnings.warn(
The MAE of the model is 12590.82 and MSE of the model is 217388758.
import pandas as pd
from statsmodels.tsa.holtwinters import ExponentialSmoothing
forecast_features.set_index('DATE', inplace=True)
# Fill in zeros for the sake of the example, replace with actual values in your data
forecast_features['UNIT_SALES'] = forecast_features['UNIT_SALES'].replace(0, method='ffill')
forecast_features['DOLLAR_SALES'] = forecast_features['DOLLAR_SALES'].replace(0, method='ffill')
# Exponential Smoothing Forecast for UNIT_SALES
exp_model_unit = ExponentialSmoothing(
forecast_features['UNIT_SALES'],
trend='add',
seasonal='add',
seasonal_periods=52
).fit()
# Exponential Smoothing Forecast for DOLLAR_SALES
exp_model_dollar = ExponentialSmoothing(
forecast_features['DOLLAR_SALES'],
trend='add',
seasonal='add',
seasonal_periods=52
).fit()
# Generating forecasts for the next 52 weeks
exp_forecast_unit = exp_model_unit.forecast(52)
exp_forecast_dollar = exp_model_dollar.forecast(52)
# Combine the forecasts into one DataFrame
forecast_df = pd.concat([exp_forecast_unit, exp_forecast_dollar], axis=1)
forecast_df.columns = ['UNIT_SALES_FORECAST', 'DOLLAR_SALES_FORECAST']
forecast_df.head(10) # Displaying the first 10 forecasted values
/usr/local/lib/python3.10/dist-packages/statsmodels/tsa/base/tsa_model.py:473: ValueWarning: A date index has been provided, but it has no associated frequency information and so will be ignored when e.g. forecasting. self._init_dates(dates, freq) /usr/local/lib/python3.10/dist-packages/statsmodels/tsa/base/tsa_model.py:473: ValueWarning: A date index has been provided, but it is not monotonic and so will be ignored when e.g. forecasting. self._init_dates(dates, freq) /usr/local/lib/python3.10/dist-packages/statsmodels/tsa/holtwinters/model.py:917: ConvergenceWarning: Optimization failed to converge. Check mle_retvals. warnings.warn( /usr/local/lib/python3.10/dist-packages/statsmodels/tsa/base/tsa_model.py:473: ValueWarning: A date index has been provided, but it has no associated frequency information and so will be ignored when e.g. forecasting. self._init_dates(dates, freq) /usr/local/lib/python3.10/dist-packages/statsmodels/tsa/base/tsa_model.py:473: ValueWarning: A date index has been provided, but it is not monotonic and so will be ignored when e.g. forecasting. self._init_dates(dates, freq) /usr/local/lib/python3.10/dist-packages/statsmodels/tsa/holtwinters/model.py:917: ConvergenceWarning: Optimization failed to converge. Check mle_retvals. warnings.warn( /usr/local/lib/python3.10/dist-packages/statsmodels/tsa/base/tsa_model.py:836: ValueWarning: No supported index is available. Prediction results will be given with an integer index beginning at `start`. return get_prediction_index( /usr/local/lib/python3.10/dist-packages/statsmodels/tsa/base/tsa_model.py:836: FutureWarning: No supported index is available. In the next version, calling this method in a model without a supported index will result in an exception. return get_prediction_index( /usr/local/lib/python3.10/dist-packages/statsmodels/tsa/base/tsa_model.py:836: ValueWarning: No supported index is available. Prediction results will be given with an integer index beginning at `start`. return get_prediction_index( /usr/local/lib/python3.10/dist-packages/statsmodels/tsa/base/tsa_model.py:836: FutureWarning: No supported index is available. In the next version, calling this method in a model without a supported index will result in an exception. return get_prediction_index(
| UNIT_SALES_FORECAST | DOLLAR_SALES_FORECAST | |
|---|---|---|
| 148 | 27571.529482 | 88626.485788 |
| 149 | 22516.259348 | 72451.544244 |
| 150 | 19161.324780 | 72114.641730 |
| 151 | 23268.457092 | 86416.510320 |
| 152 | 21455.170171 | 74671.297507 |
| 153 | 18437.162185 | 67710.984217 |
| 154 | 22050.273942 | 71270.784056 |
| 155 | 21622.112713 | 83212.977492 |
| 156 | 19837.533935 | 68923.164014 |
| 157 | 27972.108054 | 93571.239394 |
Based on this index, we got the week following last week in the dataframe with Unit sales of quantity 27571 for Non-Swire products and a dollar sales of $88626 in that week.
From the plot it can be inferred that in august month the product has more unit and dollar sales. Since the competitor has more sales during that month it is advised to have more production during that month.
Caloric Segment: Regular
Market Category: SSD
Manufacturer: Swire-CC
Brand: Sparkling Jacceptabletlester
Package Type: 11Small MLT
Flavor: ‘Avocado’
Swire plans to release this product 2 weeks prior to Easter and 2 weeks post Easter.
What will the forecasted demand be, in weeks, for this product?
We first filter caloric segment of 'Regular', category 'SSD', brand 'Sparkling Jacceptabletlester' with 'Swire CC'
from google.colab import auth
from google.cloud import bigquery
from google.colab import data_table
project = 'spring-swire-ca' # Project ID inserted based on the query results selected to explore
location = 'US' # Location inserted based on the query results selected to explore
client = bigquery.Client(project=project, location=location)
data_table.enable_dataframe_formatter()
auth.authenticate_user()
job = client.get_job('bquxjob_c7a146e_18e97c1a8fd') # Job ID inserted based on the query results selected to explore
print(job.query)
SELECT DATE,SUM(UNIT_SALES) AS UNIT_SALES, SUM(DOLLAR_SALES) AS DOLLAR_SALES FROM `swirecc.fact_market_demand` WHERE CALORIC_SEGMENT = 'REGULAR' AND MANUFACTURER = 'SWIRE-CC' AND CATEGORY = 'SSD' AND BRAND = 'SPARKLING JACCEPTABLETLESTER' GROUP BY DATE;
job = client.get_job('bquxjob_c7a146e_18e97c1a8fd') # Job ID inserted based on the query results selected to explore
results = job.to_dataframe()
results
| DATE | UNIT_SALES | DOLLAR_SALES | |
|---|---|---|---|
| 0 | 2021-01-23 | 70354.0 | 176293.12 |
| 1 | 2022-01-15 | 58533.0 | 149056.54 |
| 2 | 2022-01-22 | 56582.0 | 140421.99 |
| 3 | 2021-04-10 | 74685.0 | 188378.25 |
| 4 | 2021-06-19 | 82080.0 | 199918.61 |
| ... | ... | ... | ... |
| 142 | 2021-09-25 | 65465.0 | 166427.52 |
| 143 | 2023-07-01 | 55432.0 | 166191.44 |
| 144 | 2022-07-16 | 54396.0 | 152877.23 |
| 145 | 2022-02-05 | 56097.0 | 141844.16 |
| 146 | 2023-06-24 | 50273.0 | 155576.39 |
147 rows × 3 columns
# Converting 'DATE' column to datetime format
results['DATE'] = pd.to_datetime(results['DATE'])
# Extracting relevant features for forecasting
forecast_features = results[['DATE', 'UNIT_SALES', 'DOLLAR_SALES']]
# Adding additional time-related features
forecast_features['YEAR'] = forecast_features['DATE'].dt.year
forecast_features['MONTH'] = forecast_features['DATE'].dt.month
forecast_features['WEEK_OF_YEAR'] = forecast_features['DATE'].dt.isocalendar().week
# Displaying the prepared forecasting features
print("Forecasting Features:")
forecast_features
Forecasting Features:
| DATE | UNIT_SALES | DOLLAR_SALES | YEAR | MONTH | WEEK_OF_YEAR | |
|---|---|---|---|---|---|---|
| 0 | 2021-01-23 | 70354.0 | 176293.12 | 2021 | 1 | 3 |
| 1 | 2022-01-15 | 58533.0 | 149056.54 | 2022 | 1 | 2 |
| 2 | 2022-01-22 | 56582.0 | 140421.99 | 2022 | 1 | 3 |
| 3 | 2021-04-10 | 74685.0 | 188378.25 | 2021 | 4 | 14 |
| 4 | 2021-06-19 | 82080.0 | 199918.61 | 2021 | 6 | 24 |
| ... | ... | ... | ... | ... | ... | ... |
| 142 | 2021-09-25 | 65465.0 | 166427.52 | 2021 | 9 | 38 |
| 143 | 2023-07-01 | 55432.0 | 166191.44 | 2023 | 7 | 26 |
| 144 | 2022-07-16 | 54396.0 | 152877.23 | 2022 | 7 | 28 |
| 145 | 2022-02-05 | 56097.0 | 141844.16 | 2022 | 2 | 5 |
| 146 | 2023-06-24 | 50273.0 | 155576.39 | 2023 | 6 | 25 |
147 rows × 6 columns
import pandas as pd
from prophet import Prophet
# Convert 'DATE' to datetime and ensure it's the index
forecast_features['DATE'] = pd.to_datetime(forecast_features['DATE'])
forecast_features.set_index('DATE', inplace=True)
# Prepare the dataframe for Prophet's convention
prophet_df = forecast_features.reset_index().rename(columns={'DATE': 'ds', 'DOLLAR_SALES': 'y'})
# Define holidays dataframe for Prophet, including Easter
# We add a window around Easter as additional regressors to capture the influence of the Easter period
easter_dates = pd.date_range(start='2015-04-05', end='2025-04-20', freq='A-APR') # Rough Easter dates range
easter_df = pd.DataFrame({
'holiday': 'easter',
'ds': easter_dates,
'lower_window': -14, # 2 weeks before
'upper_window': 14, # 2 weeks after
})
# Initialize the Prophet model with holidays
m = Prophet(holidays=easter_df)
# Fit the Prophet model
m.fit(prophet_df)
# Create a future dataframe for predictions
# Extend into the future by the number of weeks you want to forecast
future = m.make_future_dataframe(periods=52*2, freq='W')
# Predict the future with the model
forecast = m.predict(future)
# Filter the predictions to the period around Easter 2024
mask = (forecast['ds'] >= '2024-03-17') & (forecast['ds'] <= '2024-04-28') # 2 weeks before and after Easter
easter_forecast = forecast[mask]
# Plot the forecast
fig = m.plot(forecast)
plt.show()
# Print the forecasted values for the Easter period
print(easter_forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']])
INFO:prophet:Disabling weekly seasonality. Run prophet with weekly_seasonality=True to override this. INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this. DEBUG:cmdstanpy:input tempfile: /tmp/tmp7x5uyqfk/rotal_dm.json DEBUG:cmdstanpy:input tempfile: /tmp/tmp7x5uyqfk/h8w6598h.json DEBUG:cmdstanpy:idx 0 DEBUG:cmdstanpy:running CmdStan, num_threads: None DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.10/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=10876', 'data', 'file=/tmp/tmp7x5uyqfk/rotal_dm.json', 'init=/tmp/tmp7x5uyqfk/h8w6598h.json', 'output', 'file=/tmp/tmp7x5uyqfk/prophet_model7vmk9js7/prophet_model-20240401035159.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000'] 03:51:59 - cmdstanpy - INFO - Chain [1] start processing INFO:cmdstanpy:Chain [1] start processing 03:51:59 - cmdstanpy - INFO - Chain [1] done processing INFO:cmdstanpy:Chain [1] done processing
ds yhat yhat_lower yhat_upper 167 2024-03-17 155430.232001 143092.959012 168266.443701 168 2024-03-24 152026.638557 139249.817349 165322.131528 169 2024-03-31 151425.512991 138135.801454 164182.091479 170 2024-04-07 158150.231133 145743.115804 170100.352276 171 2024-04-14 166621.533457 154411.097177 180128.666088 172 2024-04-21 167096.278309 155286.590345 179625.531063 173 2024-04-28 157828.651397 145500.091259 170951.679379
from sklearn.metrics import mean_absolute_error, mean_squared_error
# Calculate the split point
split_point = int(len(prophet_df) * 0.8)
# Split the data into training and test sets
train_df = prophet_df[:split_point]
test_df = prophet_df[split_point:]
# Initialize and fit the Prophet model on the training data
m = Prophet(holidays=easter_df)
m.fit(train_df)
# Create a dataframe for predictions that covers the test set period
future = m.make_future_dataframe(periods=len(test_df), freq='W')
# Predict on the future dataframe
forecast = m.predict(future)
# Filter out the predictions for the test set period
test_forecast = forecast[-len(test_df):]
# Calculate MAE and MSE using the test set
mae = mean_absolute_error(test_df['y'], test_forecast['yhat'])
mse = mean_squared_error(test_df['y'], test_forecast['yhat'])
print(f'MAE: {mae}')
print(f'MSE: {mse}')
INFO:prophet:Disabling weekly seasonality. Run prophet with weekly_seasonality=True to override this. INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this. DEBUG:cmdstanpy:input tempfile: /tmp/tmp7x5uyqfk/_qccyyeg.json DEBUG:cmdstanpy:input tempfile: /tmp/tmp7x5uyqfk/avm4oxtl.json DEBUG:cmdstanpy:idx 0 DEBUG:cmdstanpy:running CmdStan, num_threads: None DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.10/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=23927', 'data', 'file=/tmp/tmp7x5uyqfk/_qccyyeg.json', 'init=/tmp/tmp7x5uyqfk/avm4oxtl.json', 'output', 'file=/tmp/tmp7x5uyqfk/prophet_modelwm3_cuzp/prophet_model-20240401035325.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000'] 03:53:25 - cmdstanpy - INFO - Chain [1] start processing INFO:cmdstanpy:Chain [1] start processing 03:53:25 - cmdstanpy - INFO - Chain [1] done processing INFO:cmdstanpy:Chain [1] done processing
MAE: 15035.1263937731 MSE: 415153436.4735005
The MAE and MSE obtained from Prophet model is 15035 and 415153436 respectively.
The prophet model has calculated both upper and lower limits of the sales accurately. On March 17th, 2 weeks prior to Easter will have forecasted dollar sales around dollar 155430 and it dropped to dollar 152027 after 1 week. Following Easter, the dollar sales increased to dollar 158150 during April 1st week and then further continued to increase the trend with a value of dollar 166622.
We now filter on caloric segment 'Regular', category 'SSD', flavor 'Avacado' with 'Non Swire-CC'
from google.colab import auth
from google.cloud import bigquery
from google.colab import data_table
project = 'spring-swire-ca' # Project ID inserted based on the query results selected to explore
location = 'US' # Location inserted based on the query results selected to explore
client = bigquery.Client(project=project, location=location)
data_table.enable_dataframe_formatter()
auth.authenticate_user()
job = client.get_job('bquxjob_55910a8e_18e97d03276') # Job ID inserted based on the query results selected to explore
print(job.query)
SELECT DATE,SUM(UNIT_SALES) AS UNIT_SALES, SUM(DOLLAR_SALES) AS DOLLAR_SALES FROM `swirecc.fact_market_demand` WHERE CALORIC_SEGMENT = 'REGULAR' AND MANUFACTURER != 'SWIRE-CC' AND ITEM LIKE '%AVOCADO%' AND CATEGORY = 'SSD' GROUP BY DATE;
job = client.get_job('bquxjob_55910a8e_18e97d03276') # Job ID inserted based on the query results selected to explore
results = job.to_dataframe()
results
| DATE | UNIT_SALES | DOLLAR_SALES | |
|---|---|---|---|
| 0 | 2023-06-17 | 1488999.00 | 5393496.05 |
| 1 | 2021-01-02 | 1598226.00 | 4258603.90 |
| 2 | 2021-06-05 | 1919708.00 | 5320678.00 |
| 3 | 2021-11-20 | 1598380.00 | 4779108.36 |
| 4 | 2021-02-27 | 1582854.00 | 4337313.37 |
| ... | ... | ... | ... |
| 142 | 2023-08-26 | 1391590.00 | 5075602.61 |
| 143 | 2023-06-03 | 1523339.00 | 5480932.18 |
| 144 | 2023-03-11 | 1381573.00 | 5116483.31 |
| 145 | 2022-11-26 | 1586792.00 | 5427628.42 |
| 146 | 2023-10-28 | 1281609.65 | 4656563.73 |
147 rows × 3 columns
# Converting 'DATE' column to datetime format
results['DATE'] = pd.to_datetime(results['DATE'])
# Extracting relevant features for forecasting
forecast_features = results[['DATE', 'UNIT_SALES', 'DOLLAR_SALES']]
# Adding additional time-related features
forecast_features['YEAR'] = forecast_features['DATE'].dt.year
forecast_features['MONTH'] = forecast_features['DATE'].dt.month
forecast_features['WEEK_OF_YEAR'] = forecast_features['DATE'].dt.isocalendar().week
# Displaying the prepared forecasting features
print("Forecasting Features:")
forecast_features
Forecasting Features:
| DATE | UNIT_SALES | DOLLAR_SALES | YEAR | MONTH | WEEK_OF_YEAR | |
|---|---|---|---|---|---|---|
| 0 | 2023-06-17 | 1488999.00 | 5393496.05 | 2023 | 6 | 24 |
| 1 | 2021-01-02 | 1598226.00 | 4258603.90 | 2021 | 1 | 53 |
| 2 | 2021-06-05 | 1919708.00 | 5320678.00 | 2021 | 6 | 22 |
| 3 | 2021-11-20 | 1598380.00 | 4779108.36 | 2021 | 11 | 46 |
| 4 | 2021-02-27 | 1582854.00 | 4337313.37 | 2021 | 2 | 8 |
| ... | ... | ... | ... | ... | ... | ... |
| 142 | 2023-08-26 | 1391590.00 | 5075602.61 | 2023 | 8 | 34 |
| 143 | 2023-06-03 | 1523339.00 | 5480932.18 | 2023 | 6 | 22 |
| 144 | 2023-03-11 | 1381573.00 | 5116483.31 | 2023 | 3 | 10 |
| 145 | 2022-11-26 | 1586792.00 | 5427628.42 | 2022 | 11 | 47 |
| 146 | 2023-10-28 | 1281609.65 | 4656563.73 | 2023 | 10 | 43 |
147 rows × 6 columns
import pandas as pd
from prophet import Prophet
# Convert 'DATE' to datetime and ensure it's the index
forecast_features['DATE'] = pd.to_datetime(forecast_features['DATE'])
forecast_features.set_index('DATE', inplace=True)
# Prepare the dataframe for Prophet's convention
prophet_df = forecast_features.reset_index().rename(columns={'DATE': 'ds', 'DOLLAR_SALES': 'y'})
# Define holidays dataframe for Prophet, including Easter
# We add a window around Easter as additional regressors to capture the influence of the Easter period
easter_dates = pd.date_range(start='2015-04-05', end='2025-04-20', freq='A-APR') # Rough Easter dates range
easter_df = pd.DataFrame({
'holiday': 'easter',
'ds': easter_dates,
'lower_window': -14, # 2 weeks before
'upper_window': 14, # 2 weeks after
})
# Initialize the Prophet model with holidays
m = Prophet(holidays=easter_df)
# Fit the Prophet model
m.fit(prophet_df)
# Create a future dataframe for predictions
# Extend into the future by the number of weeks you want to forecast
future = m.make_future_dataframe(periods=52*2, freq='W')
# Predict the future with the model
forecast = m.predict(future)
# Filter the predictions to the period around Easter 2024
mask = (forecast['ds'] >= '2024-03-17') & (forecast['ds'] <= '2024-04-28') # 2 weeks before and after Easter
easter_forecast = forecast[mask]
# Plot the forecast
fig = m.plot(forecast)
plt.show()
# Print the forecasted values for the Easter period
print(easter_forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']])
INFO:prophet:Disabling weekly seasonality. Run prophet with weekly_seasonality=True to override this. INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this. DEBUG:cmdstanpy:input tempfile: /tmp/tmp7x5uyqfk/zvdqcg5s.json DEBUG:cmdstanpy:input tempfile: /tmp/tmp7x5uyqfk/h49r5bew.json DEBUG:cmdstanpy:idx 0 DEBUG:cmdstanpy:running CmdStan, num_threads: None DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.10/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=96299', 'data', 'file=/tmp/tmp7x5uyqfk/zvdqcg5s.json', 'init=/tmp/tmp7x5uyqfk/h49r5bew.json', 'output', 'file=/tmp/tmp7x5uyqfk/prophet_modelvy7xivrl/prophet_model-20240401040158.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000'] 04:01:58 - cmdstanpy - INFO - Chain [1] start processing INFO:cmdstanpy:Chain [1] start processing 04:01:59 - cmdstanpy - INFO - Chain [1] done processing INFO:cmdstanpy:Chain [1] done processing
ds yhat yhat_lower yhat_upper 167 2024-03-17 4.682668e+06 4.429568e+06 4.939516e+06 168 2024-03-24 4.580196e+06 4.346854e+06 4.827261e+06 169 2024-03-31 4.609232e+06 4.373648e+06 4.862417e+06 170 2024-04-07 4.805525e+06 4.538350e+06 5.043012e+06 171 2024-04-14 5.003006e+06 4.749780e+06 5.255360e+06 172 2024-04-21 5.020446e+06 4.754876e+06 5.258223e+06 173 2024-04-28 4.875050e+06 4.617603e+06 5.140573e+06
from sklearn.metrics import mean_absolute_error, mean_squared_error
# Calculate the split point
split_point = int(len(prophet_df) * 0.8)
# Split the data into training and test sets
train_df = prophet_df[:split_point]
test_df = prophet_df[split_point:]
# Initialize and fit the Prophet model on the training data
m = Prophet(holidays=easter_df)
m.fit(train_df)
# Create a dataframe for predictions that covers the test set period
future = m.make_future_dataframe(periods=len(test_df), freq='W')
# Predict on the future dataframe
forecast = m.predict(future)
# Filter out the predictions for the test set period
test_forecast = forecast[-len(test_df):]
# Calculate MAE and MSE using the test set
mae = mean_absolute_error(test_df['y'], test_forecast['yhat'])
mse = mean_squared_error(test_df['y'], test_forecast['yhat'])
print(f'MAE: {mae}')
print(f'MSE: {mse}')
INFO:prophet:Disabling weekly seasonality. Run prophet with weekly_seasonality=True to override this. INFO:prophet:Disabling daily seasonality. Run prophet with daily_seasonality=True to override this. DEBUG:cmdstanpy:input tempfile: /tmp/tmp7x5uyqfk/9_g2pknf.json DEBUG:cmdstanpy:input tempfile: /tmp/tmp7x5uyqfk/lbi5bkju.json DEBUG:cmdstanpy:idx 0 DEBUG:cmdstanpy:running CmdStan, num_threads: None DEBUG:cmdstanpy:CmdStan args: ['/usr/local/lib/python3.10/dist-packages/prophet/stan_model/prophet_model.bin', 'random', 'seed=85973', 'data', 'file=/tmp/tmp7x5uyqfk/9_g2pknf.json', 'init=/tmp/tmp7x5uyqfk/lbi5bkju.json', 'output', 'file=/tmp/tmp7x5uyqfk/prophet_model0me1f6vj/prophet_model-20240401040210.csv', 'method=optimize', 'algorithm=lbfgs', 'iter=10000'] 04:02:10 - cmdstanpy - INFO - Chain [1] start processing INFO:cmdstanpy:Chain [1] start processing 04:02:10 - cmdstanpy - INFO - Chain [1] done processing INFO:cmdstanpy:Chain [1] done processing
MAE: 508154.048327975 MSE: 431836866470.70966
The MAE and MSE obtained from Prophet model is 508154 and 431836866470 respectively.
The prophet model has calculated both upper and lower limits of the sales accurately for the competitor brands which has 'Avacado' flavor. On March 17th, 2 weeks prior to Easter will have forecasted dollar sales around dollar 4.68 Million and it dropped to dollar 4.58 Million after 1 week. Following Easter, the dollar sales increased to dollar 4.61 Million during April 1st week and then further continued to increase the trend with a value of dollar 4.81 Million.
The notebook's analysis leverages advanced forecasting modeling to offer predictive insights into sales trends, with a specific focus on the best of 13 weeks or 26 weeks and 1 year and the performance of innovative products. The collaborative approach to modeling and analysis, combined with strategic insights drawn from the data, underscores the potential for data-driven decision-making in optimizing product sales and positioning in the market.
Given the constraints of directly accessing the detailed content and questions within the notebook, this approach aims to construct a generalized conclusion that reflects the analytical depth and strategic focus of the study.
Addressing questions related to strategic recommendations or business insights, the notebook's models likely emphasizes the value of accurate sales forecasting in making informed business decisions. The detailed analysis around key periods, coupled with the predictive performance of the models, offers a foundation for strategic planning, inventory management, and promotional activities to maximize sales and revenue.
Sai Eshwar Tadepalli - Prepared Notebook, Table of Contents, Performed Modeling of Prophet, ARIMA, SARIMA and Exponential Smoothing for Innovative Product 1,2,3,6,7 Reviewed the entire code and annotations, Used Google Bigquery to bring the data into Colab, Used Google Cloud Storage to store the data, Used tableau for EDA.
Abhiram Mannam - Performance analysis of the models using MAE and MSE. Detailed writeup of the code and descriptions of models. Performing the analysis for Innovative product 5 along with their performance analysis. Analyzing the results of the models in the results section. Producing the total sum of the sales of the products in those best performing weeks. Filtering the datasets based upon the combinations in python.
Kushal Ram Tayi - Write up of the notebook. Performing the analysis for Innovative product 4 along with their performance analysis. Proofreading the entire notebook. Research on the best models to work for the forecasting series.